arkouda#
Submodules#
arkouda.accessorarkouda.alignmentarkouda.array_viewarkouda.categoricalarkouda.clientarkouda.client_dtypesarkouda.dataframearkouda.dtypesarkouda.groupbyclassarkouda.historyarkouda.indexarkouda.infoclassarkouda.ioarkouda.io_utilarkouda.joinarkouda.loggerarkouda.matcharkouda.matcherarkouda.numericarkouda.pdarrayclassarkouda.pdarraycreationarkouda.pdarraysetopsarkouda.plottingarkouda.rowarkouda.securityarkouda.segarrayarkouda.seriesarkouda.sortingarkouda.stringsarkouda.timeclassarkouda.util
Package Contents#
Classes#
A multi-dimensional view of a pdarray. Arkouda |
|
Group an array or list of arrays by value, usually in preparation |
|
The basic arkouda array class. This class contains only the |
|
Represents an array of strings whose data resides on the |
|
Represent integers as bit vectors, e.g. a set of flags. |
|
An integer-backed representation of a set of named binary fields, e.g. flags. |
|
Represent integers as IPv4 addresses. |
|
The basic arkouda array class. This class contains only the |
|
Generic enumeration. |
|
Group an array or list of arrays by value, usually in preparation |
|
Represents an array of strings whose data resides on the |
|
Represents an array of values belonging to named categories. Converting a |
|
Generic enumeration. |
|
The basic arkouda array class. This class contains only the |
|
Represents a date and/or time. |
|
Represents a duration, the difference between two dates or times. |
|
Group an array or list of arrays by value, usually in preparation |
|
The basic arkouda array class. This class contains only the |
|
Represents an array of strings whose data resides on the |
|
A DataFrame structure based on arkouda arrays. |
|
This class is useful for printing and working with individual rows of a |
|
Group an array or list of arrays by value, usually in preparation |
|
The basic arkouda array class. This class contains only the |
|
One-dimensional arkouda array with axis labels. |
|
Represents an array of values belonging to named categories. Converting a |
|
Group an array or list of arrays by value, usually in preparation |
|
The basic arkouda array class. This class contains only the |
|
Represents an array of strings whose data resides on the |
|
A DataFrame structure based on arkouda arrays. |
|
Represents a date and/or time. |
|
Represents a duration, the difference between two dates or times. |
|
Group an array or list of arrays by value, usually in preparation |
|
Represents an array of values belonging to named categories. Converting a |
|
Represents an array of strings whose data resides on the |
|
Represents a date and/or time. |
|
Custom property-like object. |
|
Functions#
|
|
|
Broadcast a dense column vector to the rows of a sparse matrix or grouped array. |
|
Cast an array to another dtype. |
|
Returns an array with elements chosen from A and B based upon a |
|
arange([start,] stop[, stride,] dtype=int64) |
|
Convert a Python or Numpy Iterable to a pdarray or Strings object, sending |
|
Return a pdarray instance pointing to an array created by the arkouda server. |
|
Create a pdarray filled with zeros. |
|
Make a callback (i.e. function) that can be called on an |
|
Convert values to an Arkouda array of IP addresses. |
|
Indicate which values are ipv4 when passed data containing IPv4 and IPv6 values. |
|
Indicate which values are ipv6 when passed data containing IPv4 and IPv6 values. |
|
|
|
Assert that numpy dtype dt is one of the dtypes supported |
|
Split numpy dtype dt into its kind and byte size, raising |
|
Try to infer what dtype arkouda_server should treat val as. |
|
Get a concrete byteorder (turns '=' into '<' or '>') |
|
Get the server's byteorder |
|
|
|
Send a clear message to clear all unregistered data from the server symbol table |
|
Return True iff any element of the array evaluates to True. |
|
Return True iff all elements of the array evaluate to True. |
|
Return True iff the array is monotonically non-decreasing. |
|
Return the sum of all elements in the array. |
|
Return the product of all elements in the array. Return value is |
|
Return the minimum value of the array. |
|
Return the maximum value of the array. |
|
Return the index of the first occurrence of the array min value. |
|
Return the index of the first occurrence of the array max value. |
|
Return the mean of the array. |
|
Return the variance of values in the array. |
|
Return the standard deviation of values in the array. The standard |
|
Find the k minimum values of an array. |
|
Find the k maximum values of an array. |
|
Finds the indices corresponding to the k minimum values of an array. |
|
Find the indices corresponding to the k maximum values of an array. |
|
Find the population (number of bits set) for each integer in an array. |
|
Find the bit parity (XOR of all bits) for each integer in an array. |
|
Count leading zeros for each integer in an array. |
|
Count trailing zeros for each integer in an array. |
|
Rotate bits of <x> to the left by <rot>. |
|
Rotate bits of <x> to the left by <rot>. |
|
Return the covariance of x and y |
|
Return the correlation between x and y |
|
|
|
Takes the square root of array. If where is given, the operation will only take place in |
|
Raises an array to a power. If where is given, the operation will only take place in the positions |
|
Returns the element-wise remainder of division. |
|
Returns the element-wise remainder of division. |
|
class method to return a pdarray attached to the registered name in the arkouda |
|
Unregister a named pdarray in the arkouda server which was previously |
|
Test whether each element of a 1-D array is also present in a second array. |
|
Concatenate a list or tuple of |
|
Find the union of two arrays/List of Arrays. |
|
Find the intersection of two arrays. |
|
Find the set difference of two arrays. |
|
Find the set exclusive-or (symmetric difference) of two arrays. |
|
Convert a Python or Numpy Iterable to a pdarray or Strings object, sending |
|
Create a pdarray filled with zeros. |
|
Create a pdarray filled with ones. |
|
Create a pdarray filled with fill_value. |
|
Create a zero-filled pdarray of the same size and dtype as an existing |
|
Create a one-filled pdarray of the same size and dtype as an existing |
|
Create a pdarray filled with fill_value of the same size and dtype as an existing |
|
arange([start,] stop[, stride,] dtype=int64) |
|
Create a pdarray of linearly-spaced floats in a closed interval. |
|
Generate a pdarray of randomized int, float, or bool values in a |
|
Generate a pdarray with uniformly distributed random float values |
|
Draw real numbers from the standard normal distribution. |
|
Generate random strings with lengths uniformly distributed between |
|
Generate random strings with log-normally distributed lengths and |
|
Converts a Pandas Series to an Arkouda pdarray or Strings object. If |
|
Create a bigint pdarray from an iterable of uint pdarrays. |
|
Cast an array to another dtype. |
|
Return the element-wise absolute value of the array. |
|
Return the element-wise natural log of the array. |
|
Return the element-wise exponential of the array. |
|
Return the cumulative sum over the array. |
|
Return the cumulative product over the array. |
|
Return the element-wise sine of the array. |
|
Return the element-wise cosine of the array. |
|
Return the element-wise tangent of the array. |
|
Return the element-wise inverse sine of the array. The result is between -pi/2 and pi/2. |
|
Return the element-wise inverse cosine of the array. The result is between 0 and pi. |
|
Return the element-wise inverse tangent of the array. The result is between -pi/2 and pi/2. |
|
Return the element-wise inverse tangent of the array pair. The result chosen is the |
|
Return the element-wise hyperbolic sine of the array. |
|
Return the element-wise hyperbolic cosine of the array. |
|
Return the element-wise hyperbolic tangent of the array. |
|
Return the element-wise inverse hyperbolic sine of the array. |
|
Return the element-wise inverse hyperbolic cosine of the array. |
|
Return the element-wise inverse hyperbolic tangent of the array. |
|
Converts angles element-wise from radians to degrees. |
|
Converts angles element-wise from degrees to radians. |
|
Return an element-wise hash of the array or list of arrays. |
|
Returns an array with elements chosen from A and B based upon a |
|
Compute a histogram of evenly spaced bins over the range of an array. |
|
Count the occurrences of the unique values of an array. |
|
Test a pdarray for Not a number / NaN values |
|
Find the unique elements of an array. |
|
Broadcast a dense column vector to the rows of a sparse matrix or grouped array. |
|
Performs an inner-join on equality between two integer arrays where |
|
Generate a segmented array of variable-length, contiguous ranges between pairs of |
|
Compute the internal size of a hypothetical join between a and b. Returns |
|
Enables verbose logging (DEBUG log level) for all ArkoudaLoggers |
|
Disables verbose logging (DEBUG log level) for all ArkoudaLoggers, setting |
|
Allows the user to write custom logs. |
|
|
|
Return the element-wise absolute value of the array. |
|
Cast an array to another dtype. |
|
Return a pdarray instance pointing to an array created by the arkouda server. |
|
Converts a Pandas Series to an Arkouda pdarray or Strings object. If |
|
Creates a fixed frequency Datetime range. Alias for |
|
Return a fixed frequency TimedeltaIndex, with day as the default |
|
Returns JSON formatted string containing information about the objects in names |
|
Return a list containing the names of all registered objects |
|
Return a list containing the names of all objects in the symbol table |
|
Prints verbose information for each object in names in a human readable format |
|
|
|
Broadcast a dense column vector to the rows of a sparse matrix or grouped array. |
|
Generate a segmented array of variable-length, contiguous ranges between pairs of |
|
A convenience method for instantiating an ArkoudaLogger that retrieves the |
|
Return the cumulative sum over the array. |
|
Return a pdarray instance pointing to an array created by the arkouda server. |
|
Return True iff the array is monotonically non-decreasing. |
|
arange([start,] stop[, stride,] dtype=int64) |
|
Convert a Python or Numpy Iterable to a pdarray or Strings object, sending |
|
Create a pdarray filled with ones. |
|
Create a pdarray filled with zeros. |
|
Concatenate a list or tuple of |
|
Alias for the from_parts function. Prevents user from needing to call ak.SegArray constructor |
|
Analogous to other python 'sorted(obj)' functions in that it returns |
|
Find the intersection of two arkouda arrays. |
|
Find the inverse of a permutation array. |
|
Find all the rows that are in both dataframes. Columns should be in |
|
Utilizes the ak.join.inner_join function to return an ak |
|
Utilizes the ak.join.inner_join_merge function to return an |
|
Utilizes the ak.join.inner_join_merge and the ak.join.right_join_merge |
|
Find the unique elements of an array. |
|
Cast an array to another dtype. |
|
arange([start,] stop[, stride,] dtype=int64) |
|
Convert a Python or Numpy Iterable to a pdarray or Strings object, sending |
|
Return a pdarray instance pointing to an array created by the arkouda server. |
|
Create a pdarray filled with ones. |
|
Test whether each element of a 1-D array is also present in a second array. |
|
Convert a Categorical array to Strings for display |
|
|
|
|
|
Broadcast a dense column vector to the rows of a sparse matrix or grouped array. |
|
Find the unique elements of an array. |
|
Returns an array with elements chosen from A and B based upon a |
|
arange([start,] stop[, stride,] dtype=int64) |
|
Create a pdarray filled with fill_value. |
|
Create a pdarray filled with ones. |
|
Create a pdarray filled with zeros. |
|
Concatenate a list or tuple of |
|
Test whether each element of a 1-D array is also present in a second array. |
|
|
|
Map an array of sparse values to 0-up indices. |
|
Map multiple arrays of sparse identifiers to a common 0-up index. |
|
Map two arrays of sparse values to the 0-up index set implied by the right array, |
|
Map two arrays of sparse identifiers to the 0-up index set implied by the left array, |
|
Return indices of query items in a search list of items (-1 if not found). |
|
Apply the function defined by the mapping keys --> values to arguments. |
|
Test each value for membership in any of a set of half-open (pythonic) |
|
Given an array of query vals and non-overlapping, closed intervals, return |
|
Return True iff the arrays are cosorted, i.e., if the arrays were columns in a table |
|
Apply a function defined over intervals to an array of arguments. |
|
Creates a fixed frequency Datetime range. Alias for |
|
Return a fixed frequency TimedeltaIndex, with day as the default |
|
Computes the sample skewness of an array. |
|
arange([start,] stop[, stride,] dtype=int64) |
|
Compute a histogram of evenly spaced bins over the range of an array. |
|
Test a pdarray for Not a number / NaN values |
|
Plot the distribution and cumulative distribution of histogram Data |
|
Create a grid plot histogramming all numeric columns in ak dataframe |
|
|
|
|
|
Get the type of a file accessible to the server. Supported |
|
This function calls the h5ls utility on a HDF5 file visible to the |
|
Used for identifying the datasets within a file when a CSV does not |
|
Get null indices of a string column in a Parquet file. |
|
Get the names of the datasets in the provide files |
|
Get a list of column names from CSV file(s). |
|
Read Arkouda objects from HDF5 file/s |
|
Read Arkouda objects from Parquet file/s |
|
Read CSV file(s) into Arkouda objects. If more than one dataset is found, the objects |
|
Read datasets from files. |
|
Read datasets from files and tag each record to the file it was read from. |
|
Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or |
|
Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be |
|
Save multiple named pdarrays to HDF5 files. |
|
Save multiple named pdarrays to Parquet files. |
|
Write Arkouda object(s) to CSV file(s). All CSV Files written by Arkouda |
|
DEPRECATED |
|
Load a pdarray previously saved with |
|
Load multiple pdarrays, Strings, SegArrays, or Categoricals previously |
|
Overwrite the datasets with name appearing in names or keys in columns if columns |
|
Create a snapshot of the current Arkouda namespace. All currently accessible variables containing |
|
Return data saved using ak.snapshot |
|
Receive a pdarray sent by pdarray.transfer(). |
|
Receive a pdarray sent by dataframe.transfer(). |
|
|
|
|
|
Attach to all objects registered with the names provide |
|
Unregister all names provided |
|
Register all objects in the provided dictionary |
|
Determine if the name provided is associated with a registered Object |
Attributes#
The DType enum defines the supported Arkouda data types in string form. |
|
- class arkouda.ArrayView(base: arkouda.pdarrayclass.pdarray, shape, order='row_major')#
A multi-dimensional view of a pdarray. Arkouda
ArraryViewbehaves similarly to numpy’s ndarray. The base pdarray is stored in 1-dimension but can be indexed and treated logically as if it were multi-dimensional- dtype#
The element type of the base pdarray (equivalent to base.dtype)
- Type:
dtype
- size#
The number of elements in the base pdarray (equivalent to base.size)
- Type:
int_scalars
- ndim#
Number of dimensions (equivalent to shape.size)
- Type:
int_scalars
- itemsize#
The size in bytes of each element (equivalent to base.itemsize)
- Type:
int_scalars
- order#
Index order to read and write the elements. By default or if ‘C’/’row_major’, read and write data in row_major order If ‘F’/’column_major’, read and write data in column_major order
- Type:
str {‘C’/’row_major’ | ‘F’/’column_major’}
- objType = 'ArrayView'#
- to_ndarray() numpy.ndarray#
Convert the ArrayView to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the ArrayView size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same attributes and data as the ArrayView
- Return type:
np.ndarray
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the ArrayView size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.arange(6).reshape(2,3) >>> a.to_ndarray() array([[0, 1, 2], [3, 4, 5]]) >>> type(a.to_ndarray()) numpy.ndarray
- to_list() list#
Convert the ArrayView to a list, transferring array data from the Arkouda server to client-side Python. Note: if the ArrayView size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A list with the same data as the ArrayView
- Return type:
list
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the ArrayView size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(6).reshape(2,3) >>> a.to_list() [[0, 1, 2], [3, 4, 5]] >>> type(a.to_list()) list
- to_hdf(prefix_path: str, dataset: str = 'ArrayView', mode: str = 'truncate', file_type: str = 'distribute')#
Save the current ArrayView object to hdf5 file
- Parameters:
prefix_path (str) – Path to the file to write the dataset to
dataset (str) – Name of the dataset to write
mode (str (truncate | append)) – Default: truncate Mode to write the dataset in. Truncate will overwrite any existing files. Append will add the dataset to an existing file.
file_type (str (single|distribute)) – Default: distribute Indicates the format to save the file. Single will store in a single file. Distribute will store the date in a file per locale.
- update_hdf(prefix_path: str, dataset: str = 'ArrayView', repack: bool = True)#
Overwrite the dataset with the name provided with this array view object. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the array view
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
- arkouda.bitType#
- arkouda.intTypes#
- arkouda.isSupportedInt(num)#
- arkouda.akuint64#
- class arkouda.GroupBy(keys: groupable | None = None, assume_sorted: bool = False, **kwargs)#
Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.
- Parameters:
keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row
assume_sorted (bool) – If True, assume keys is already sorted (Default: False)
- nkeys#
The number of key arrays (columns)
- Type:
int
- size#
The length of the input array(s), i.e. number of rows
- Type:
int
- unique_keys#
The unique values of the keys array(s), in grouped order
- Type:
(list of) pdarray, Strings, or Categorical
- ngroups#
The length of the unique_keys array(s), i.e. number of groups
- Type:
int
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
- Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that groups the array
If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.
- Reductions#
- objType = 'GroupBy'#
- static from_return_msg(rep_msg)#
- to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')#
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Returns:
None
GroupBy is not currently supported by Parquet
- update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)#
- size() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
See also
Notes
This alias for “count” was added to conform with Pandas API
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.size() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- count() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- aggregate(values: groupable, operator: str, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, groupable]#
Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.
- Parameters:
values (pdarray) – The values to group and reduce
operator (str) – The name of the reduction operator to use
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
aggregates (groupable) – One aggregate value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if the requested operator is not supported for the values dtype
Examples
>>> keys = ak.arange(0, 10) >>> vals = ak.linspace(-1, 1, 10) >>> g = ak.GroupBy(keys) >>> g.aggregate(vals, 'sum') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768, -0.55555555555555536, -0.33333333333333348, -0.11111111111111116, 0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768, 1])) >>> g.aggregate(vals, 'min') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779, -0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116, 0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
- sum(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.
- Parameters:
values (pdarray) – The values to group and sum
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_sums (pdarray) – One sum per unique key in the GroupBy instance
skipna (bool) – boolean which determines if NANs should be skipped
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The grouped sum of a boolean
pdarrayreturns integers.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.sum(b) (array([2, 3, 4]), array([8, 14, 6]))
- prod(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.
- Parameters:
values (pdarray) – The values to group and multiply
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_products (pdarray, float64) – One product per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if prod is not supported for the values dtype
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.prod(b) (array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
- var(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.
- Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean, i.e.,
var = mean((x - x.mean())**2).The mean is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of a hypothetical infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.var(b) (array([2 3 4]), array([2.333333333333333 1.2 0]))
- std(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.
- Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,
std = sqrt(mean((x - x.mean())**2)).The average squared deviation is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of the infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1, it will not be an unbiased estimate of the standard deviation per se.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.std(b) (array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
- mean(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.
- Parameters:
values (pdarray) – The values to group and average
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.mean(b) (array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
- median(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.
- Parameters:
values (pdarray) – The values to group and find median
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,9) >>> a array([4 1 4 3 2 2 2 3 3]) >>> g = ak.GroupBy(a) >>> g.keys array([4 1 4 3 2 2 2 3 3]) >>> b = ak.linspace(-5,5,9) >>> b array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5]) >>> g.median(b) (array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
- min(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find minima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_minima (pdarray) – One minimum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if min is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.min(b) (array([2, 3, 4]), array([1, 1, 3]))
- max(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find maxima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_maxima (pdarray) – One maximum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if max is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.max(b) (array([2, 3, 4]), array([4, 4, 3]))
- argmin(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmin
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmin(b) (array([2, 3, 4]), array([5, 4, 2]))
- argmax(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmax
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmax(b) (array([2, 3, 4]), array([9, 3, 2]))
- nunique(values: groupable) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.
- Parameters:
values (pdarray, int64) – The values to group and find unique values
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if nunique is not supported for the values dtype
Examples
>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> data array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> labels ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g = ak.GroupBy(labels) >>> g.keys ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g.nunique(data) array([1,2,3,4]), array([2, 2, 3, 1]) # Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4 # Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4 # Group (3,3,3) has values [3,4,1] -> 3 unique values # Group (4) has values [4] -> 1 unique value
- any(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “or”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
- all(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “and”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- OR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise OR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with OR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- AND(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise AND of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with AND
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- XOR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise XOR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with XOR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- first(values: groupable_element_type) Tuple[groupable, groupable_element_type]#
First value in each group.
- Parameters:
values (pdarray-like) – The values from which to take the first of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first value of each group
- mode(values: groupable) Tuple[groupable, groupable]#
Most common value in each group. If a group is multi-modal, return the modal value that occurs first.
- Parameters:
values ((list of) pdarray-like) – The values from which to take the mode of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) pdarray-like) – The most common value of each group
- unique(values: groupable)#
Return the set of unique values in each group, as a SegArray.
- Parameters:
values ((list of) pdarray-like) – The values to unique
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) SegArray) – The unique values of each group
- Raises:
TypeError – Raised if values is or contains Strings or Categorical
- broadcast(values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, permute: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Fill each group’s segment with a constant value.
- Parameters:
- Returns:
The broadcasted values
- Return type:
- Raises:
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one value per segment
Notes
This function is a sparse analog of
np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.Examples
>>> a = ak.array([0, 1, 0, 1, 0]) >>> values = ak.array([3, 5]) >>> g = ak.GroupBy(a) # By default, result is in original order >>> g.broadcast(values) array([3, 5, 3, 5, 3]) # With permute=False, result is in grouped order >>> g.broadcast(values, permute=False) array([3, 3, 3, 5, 5] >>> a = ak.randint(1,5,10) >>> a array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> g.broadcast(counts > 2) array([True False True True True False True True False False]) >>> g.broadcast(counts == 3) array([True False True True True False True True False False]) >>> g.broadcast(counts < 4) array([True True True True True True True True True True])
- static build_from_components(user_defined_name: str = None, **kwargs) GroupBy#
function to build a new GroupBy object from component keys and permutation.
- Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
- Returns:
The GroupBy object created by using the given components
- Return type:
- register(user_defined_name: str) GroupBy#
Register this GroupBy object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components
- Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the GroupBy with the user_defined_name
See also
unregister,attach,unregister_groupby_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() bool#
Return True if the object is contained in the registry
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static attach(user_defined_name: str) GroupBy#
Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which GroupBy object was registered under
- Returns:
The GroupBy object created by re-attaching to the corresponding server components
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
register,is_registered,unregister,unregister_groupby_by_name
- static unregister_groupby_by_name(user_defined_name: str) None#
Function to unregister GroupBy object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the GroupBy object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- most_common(values)#
(Deprecated) See GroupBy.mode().
- arkouda.broadcast(segments: arkouda.pdarrayclass.pdarray, values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, size: int | numpy.int64 | numpy.uint64 = -1, permutation: arkouda.pdarrayclass.pdarray | None = None)#
Broadcast a dense column vector to the rows of a sparse matrix or grouped array.
- Parameters:
segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.
values (pdarray, Strings) – The values to broadcast, one per row (or group)
size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.
permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.
- Returns:
The broadcast values, one per nonzero
- Return type:
- Raises:
ValueError –
If segments and values are different sizes
If segments are empty
If number of nonzeros (either user-specified or inferred from permutation) is less than one
Examples
>>> # Define a sparse matrix with 3 rows and 7 nonzeros >>> row_starts = ak.array([0, 2, 5]) >>> nnz = 7 # Broadcast the row number to each nonzero element >>> row_number = ak.arange(3) >>> ak.broadcast(row_starts, row_number, nnz) array([0 0 1 1 1 2 2]) # If the original nonzeros were in reverse order... >>> permutation = ak.arange(6, -1, -1) >>> ak.broadcast(row_starts, row_number, permutation=permutation) array([2 2 1 1 1 0 0])
- arkouda.akcast(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, dt: numpy.dtype | type | str | arkouda.dtypes.BigInt, errors: ErrorMode = ErrorMode.strict) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical | Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Cast an array to another dtype.
- Parameters:
dt (np.dtype, type, or str) – The target dtype to cast values to
errors ({strict, ignore, return_validity}) –
Controls how errors are handled when casting strings to a numeric type (ignored for casts from numeric types).
strict: raise RuntimeError if any string cannot be converted
- ignore: never raise an error. Uninterpretable strings get
converted to NaN (float64), -2**63 (int64), zero (uint64 and uint8), or False (bool)
return_validity: in addition to returning the same output as “ignore”, also return a bool array indicating where the cast was successful.
- Returns:
pdarray or Strings – Array of values cast to desired dtype
[validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is returned with True where the cast succeeded and False where it failed.
Notes
The cast is performed according to Chapel’s casting rules and is NOT safe from overflows or underflows. The user must ensure that the target dtype has the precision and capacity to hold the desired result.
Examples
>>> ak.cast(ak.linspace(1.0,5.0,5), dt=ak.int64) array([1, 2, 3, 4, 5])
>>> ak.cast(ak.arange(0,5), dt=ak.float64).dtype dtype('float64')
>>> ak.cast(ak.arange(0,5), dt=ak.bool) array([False, True, True, True, True])
>>> ak.cast(ak.linspace(0,4,5), dt=ak.bool) array([False, True, True, True, True])
- arkouda.where(condition: arkouda.pdarrayclass.pdarray, A: str | arkouda.dtypes.numeric_scalars | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, B: str | arkouda.dtypes.numeric_scalars | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical#
Returns an array with elements chosen from A and B based upon a conditioning array. As is the case with numpy.where, the return array consists of values from the first array (A) where the conditioning array elements are True and from the second array (B) where the conditioning array elements are False.
- Parameters:
condition (pdarray) – Used to choose values from A or B
A (Union[numeric_scalars, str, pdarray, Strings, Categorical]) – Value(s) used when condition is True
B (Union[numeric_scalars, str, pdarray, Strings, Categorical]) – Value(s) used when condition is False
- Returns:
Values chosen from A where the condition is True and B where the condition is False
- Return type:
- Raises:
TypeError – Raised if the condition object is not a pdarray, if A or B is not an int, np.int64, float, np.float64, pdarray, str, Strings, Categorical if pdarray dtypes are not supported or do not match, or multiple condition clauses (see Notes section) are applied
ValueError – Raised if the shapes of the condition, A, and B pdarrays are unequal
Examples
>>> a1 = ak.arange(1,10) >>> a2 = ak.ones(9, dtype=np.int64) >>> cond = a1 < 5 >>> ak.where(cond,a1,a2) array([1, 2, 3, 4, 1, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10) >>> a2 = ak.ones(9, dtype=np.int64) >>> cond = a1 == 5 >>> ak.where(cond,a1,a2) array([1, 1, 1, 1, 5, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10) >>> a2 = 10 >>> cond = a1 < 5 >>> ak.where(cond,a1,a2) array([1, 2, 3, 4, 10, 10, 10, 10, 10])
>>> s1 = ak.array([f'str {i}' for i in range(10)]) >>> s2 = 'str 21' >>> cond = (ak.arange(10) % 2 == 0) >>> ak.where(cond,s1,s2) array(['str 0', 'str 21', 'str 2', 'str 21', 'str 4', 'str 21', 'str 6', 'str 21', 'str 8','str 21'])
>>> c1 = ak.Categorical(ak.array([f'str {i}' for i in range(10)])) >>> c2 = ak.Categorical(ak.array([f'str {i}' for i in range(9, -1, -1)])) >>> cond = (ak.arange(10) % 2 == 0) >>> ak.where(cond,c1,c2) array(['str 0', 'str 8', 'str 2', 'str 6', 'str 4', 'str 4', 'str 6', 'str 2', 'str 8', 'str 0'])
Notes
A and B must have the same dtype and only one conditional clause is supported e.g., n < 5, n > 1, which is supported in numpy is not currently supported in Arkouda
- exception arkouda.RegistrationError#
Bases:
ExceptionError/Exception used when the Arkouda Server cannot register an object
- class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.dtypes.int_scalars, ndim: arkouda.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.dtypes.int_scalars, max_bits: int | None = None)#
The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.
- name#
The server-side identifier for the array
- Type:
str
- dtype#
The element type of the array
- Type:
dtype
- size#
The number of elements in the array
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
A list or tuple containing the sizes of each dimension of the array
- Type:
Sequence[int]
- itemsize#
The size in bytes of each element
- Type:
int_scalars
- property max_bits#
- BinOps#
- OpEqOps#
- objType = 'pdarray'#
- format_other(other) str#
Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.
- Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
- Return type:
string representation of np.dtype corresponding to the other parameter
- Raises:
TypeError – Raised if the other parameter cannot be converted to Numpy dtype
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a pdarray to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- opeq(other, op)#
- fill(value: arkouda.dtypes.numeric_scalars) None#
Fill the array (in place) with a constant value.
- Parameters:
value (numeric_scalars) –
- Raises:
TypeError – Raised if value is not an int, int64, float, or float64
- any() numpy.bool_#
Return True iff any element of the array evaluates to True.
- all() numpy.bool_#
Return True iff all elements of the array evaluate to True.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
Note
This will return True if the object is registered itself or as a component of another object
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- is_sorted() numpy.bool_#
Return True iff the array is monotonically non-decreasing.
- Parameters:
None –
- Returns:
Indicates if the array is monotonically non-decreasing
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- sum() arkouda.dtypes.numeric_and_bool_scalars#
Return the sum of all elements in the array.
- prod() numpy.float64#
Return the product of all elements in the array. Return value is always a np.float64 or np.int64.
- min() arkouda.dtypes.numpy_scalars#
Return the minimum value of the array.
- max() arkouda.dtypes.numpy_scalars#
Return the maximum value of the array.
- argmin() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array min value
- argmax() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array max value.
- mean() numpy.float64#
Return the mean of the array.
- var(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the variance. See
arkouda.varfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
The scalar variance of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown
- std(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the standard deviation. See
arkouda.stdfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
The scalar standard deviation of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- cov(y: pdarray) numpy.float64#
Compute the covariance between self and y.
- Parameters:
y (pdarray) – Other pdarray used to calculate covariance
- Returns:
The scalar covariance of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- corr(y: pdarray) numpy.float64#
Compute the correlation between self and y using pearson correlation coefficient.
- Parameters:
y (pdarray) – Other pdarray used to calculate correlation
- Returns:
The scalar correlation of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- mink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- maxk(k: arkouda.dtypes.int_scalars) pdarray#
Compute the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmaxk(k: arkouda.dtypes.int_scalars) pdarray#
Finds the indices corresponding to the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- value_counts()#
Count the occurrences of the unique values of self.
- Returns:
unique_values (pdarray) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs
Examples
>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts() (array([0, 2, 4]), array([3, 2, 1]))
- astype(dtype) pdarray#
Cast values of pdarray to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- slice_bits(low, high) pdarray#
Returns a pdarray containing only bits from low to high of self.
This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)
- Parameters:
low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0
high (int) – The highest bit included in the slice (inclusive)
- Returns:
A new pdarray containing the bits of self from low to high
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> p = ak.array([2**65 + (2**64 - 1)]) >>> bin(p[0]) '0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0]) '0b10'
- bigint_to_uint_arrays() List[pdarray]#
Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Returns:
A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Return type:
List[pdarrays]
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> a = ak.arange(2**64, 2**64 + 5) >>> a array(["18446744073709551616" "18446744073709551617" "18446744073709551618" "18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays() [array([1 1 1 1 1]), array([0 1 2 3 4])]
- reshape(*shape, order='row_major')#
Gives a new shape to an array without changing its data.
- Parameters:
shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.
order (str {'row_major' | 'C' | 'column_major' | 'F'}) – Read the elements of the pdarray in this index order By default, read the elements in row_major or C-like order where the last index changes the fastest If ‘column_major’ or ‘F’, read the elements in column_major or Fortran-like order where the first index changes the fastest
- Returns:
An arrayview object with the data from the array but with the new shape
- Return type:
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same attributes and data as the pdarray
- Return type:
np.ndarray
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_ndarray() array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray()) numpy.ndarray
- to_list() List#
Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A list with the same data as the pdarray
- Return type:
list
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_list() [0, 1, 2, 3, 4]
>>> type(a.to_list()) list
- to_cuda()#
Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.
- Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
- Return type:
numba.DeviceNDArray
- Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_cuda() array([0, 1, 2, 3, 4])
>>> type(a.to_cuda()) numpy.devicendarray
- to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str#
Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_parquet('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_parqet('path/prefix.parquet', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str#
Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_hdf('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_hdf('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving to a single file >>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single') Saves the array in to single hdf5 file on the root node. ``cwd/path/name_prefix.hdf5``
- update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)#
Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)#
Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.
- prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- dataset: str
Column name to save the pdarray under. Defaults to “array”.
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
str reponse message
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string
See also
save_all,load,read,to_parquet,to_hdfNotes
The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. Previously all files saved in Parquet format were saved with a.parquetfile extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.save('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.save('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving with an extension (Parquet) >>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet') Saves the array in numLocales Parquet files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- register(user_defined_name: str) pdarray#
Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.
- Parameters:
user_defined_name (str) – user defined name array is to be registered under
- Returns:
The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.
See also
attach,unregister,is_registered,list_registry,unregister_pdarray_by_nameNotes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- unregister() None#
Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- static attach(user_defined_name: str) pdarray#
class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which array was registered under
- Returns:
pdarray which is bound to the corresponding server side component which was registered with user_defined_name
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- arkouda.arange(*args, **kwargs) arkouda.pdarrayclass.pdarray#
arange([start,] stop[, stride,] dtype=int64)
Create a pdarray of consecutive integers within the interval [start, stop). If only one arg is given then arg is the stop parameter. If two args are given, then the first arg is start and second is stop. If three args are given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
- Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stop (int_scalars) – Stopping value (exclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1, if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Integers from start (inclusive) to stop (exclusive) by stride
- Return type:
pdarray, dtype
- Raises:
TypeError – Raised if start, stop, or stride is not an int object
ZeroDivisionError – Raised if stride == 0
Notes
Negative strides result in decreasing values. Currently, only int64 pdarrays can be created with this method. For float64 arrays, use the linspace method.
Examples
>>> ak.arange(0, 5, 1) array([0, 1, 2, 3, 4])
>>> ak.arange(5, 0, -1) array([5, 4, 3, 2, 1])
>>> ak.arange(0, 10, 2) array([0, 2, 4, 6, 8])
>>> ak.arange(-5, -10, -1) array([-5, -6, -7, -8, -9])
- arkouda.array(a: arkouda.pdarrayclass.pdarray | numpy.ndarray | Iterable, dtype: numpy.dtype | type | str = None, max_bits: int = -1) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Convert a Python or Numpy Iterable to a pdarray or Strings object, sending the corresponding data to the arkouda server.
- Parameters:
a (Union[pdarray, np.ndarray]) – Rank-1 array of a supported dtype
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
A pdarray instance stored on arkouda server or Strings instance, which is composed of two pdarrays stored on arkouda server
- Return type:
- Raises:
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a list, array, tuple, or deque
RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is not supported (not in DTypes), or if the product of a size and a.itemsize > maxTransferBytes
ValueError – Raised if the returned message is malformed or does not contain the fields required to generate the array.
See also
Notes
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overwhelming the connection between the Python client and the arkouda server, under the assumption that it is a low-bandwidth connection. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively to create the Strings object and the two corresponding pdarrays for string bytes and offsets, respectively.
Examples
>>> ak.array(np.arange(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> ak.array(range(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> strings = ak.array([f'string {i}' for i in range(0,5)]) >>> type(strings) <class 'arkouda.strings.Strings'>
- arkouda.create_pdarray(repMsg: str, max_bits=None) pdarray#
Return a pdarray instance pointing to an array created by the arkouda server. The user should not call this function directly.
- Parameters:
repMsg (str) – space-delimited string containing the pdarray name, datatype, size dimension, shape,and itemsize
- Returns:
A pdarray with the same attributes and data as the pdarray; on GPU
- Return type:
- Raises:
ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance
RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance
- arkouda.zeros(size: arkouda.dtypes.int_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray#
Create a pdarray filled with zeros.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
dtype (all_scalars) – Type of resulting array, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Zeros of the requested size and dtype
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
See also
Examples
>>> ak.zeros(5, dtype=ak.int64) array([0, 0, 0, 0, 0])
>>> ak.zeros(5, dtype=ak.float64) array([0, 0, 0, 0, 0])
>>> ak.zeros(5, dtype=ak.bool) array([False, False, False, False, False])
- class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.dtypes.int_scalars)#
Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.
- entry#
Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of
offsets array: starting indices for each string
bytes array: raw bytes of all strings joined by nulls
- Type:
- size#
The number of strings in the array
- Type:
int_scalars
- nbytes#
The total number of bytes in all strings
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
The sizes of each dimension of the array
- Type:
tuple
- dtype#
The dtype is ak.str
- Type:
dtype
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
Notes
Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.
- BinOps#
- objType = 'Strings'#
- static from_return_msg(rep_msg: str) Strings#
Factory method for creating a Strings object from an Arkouda server response message
- Parameters:
rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.
- static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings#
Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.
- Parameters:
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.
- get_lengths() arkouda.pdarrayclass.pdarray#
Return the length of each string in the array.
- Returns:
The length of each string
- Return type:
pdarray, int
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- get_bytes()#
Getter for the bytes component (uint8 pdarray) of this Strings.
- Returns:
Pdarray of bytes of the string accessed
- Return type:
pdarray, uint8
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_bytes() [111 110 101 0 116 119 111 0 116 104 114 101 101 0]
- get_offsets()#
Getter for the offsets component (int64 pdarray) of this Strings.
- Returns:
Pdarray of offsets of the string accessed
- Return type:
pdarray, int64
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_offsets() [0 4 8]
- encode(toEncoding: str, fromEncoding: str = 'UTF-8')#
Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding
- Parameters:
toEncoding (str) – The encoding that the strings will be converted to
fromEncoding (str) – The current encoding of the strings object, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- decode(fromEncoding, toEncoding='UTF-8')#
Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding
- Parameters:
fromEncoding (str) – The current encoding of the strings object
toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- to_lower() Strings#
Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Returns:
Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_lower() array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
- to_upper() Strings#
Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Returns:
Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_upper() array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
- to_title() Strings#
Returns a new Strings from the original replaced with their titlecase equivalent
- Returns:
Strings from the original replaced with their titlecase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Strings.to_lower,String.to_upperExamples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_title() array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
- is_lower() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase
- Returns:
True for elements that are entirely lowercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_lower() array([True True True False False False])
- is_upper() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase
- Returns:
True for elements that are entirely uppercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_upper() array([False False False True True True])
- is_title() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase
- Returns:
True for elements that are titlecase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)]) >>> title = ak.array([f'Strings {i}' for i in range(3)]) >>> strings = ak.concatenate([mixed, title]) >>> strings array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2']) >>> strings.is_title() array([False False False True True True])
- strip(chars: bytes | arkouda.dtypes.str_scalars | None = '') Strings#
Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
- Parameters:
chars – the set of characters to be removed
- Returns:
Strings object with the leading and trailing characters matching the set of characters in the chars argument removed
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> strings = ak.array(['Strings ', ' StringS ', 'StringS ']) >>> s = strings.strip() >>> s array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS ', ' 1StringS 12 ']) >>> s = strings.strip(' 12') >>> s array(['Strings', 'StringS', 'StringS'])
- cached_regex_patterns() List#
Returns the regex patterns for which Match objects have been cached
- purge_cached_regex_patterns() None#
purges cached regex patterns
- find_locations(pattern: bytes | arkouda.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches
- Parameters:
pattern (str_scalars) – The regex pattern used to find matches
- Returns:
pdarray, int64 – For each original string, the number of pattern matches
pdarray, int64 – The start positons of pattern matches
pdarray, int64 – The lengths of pattern matches
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> num_matches, starts, lens = strings.find_locations('\d') >>> num_matches array([2, 2, 2, 2, 2]) >>> starts array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9]) >>> lens array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
- search(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match if any part of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+') <ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- match(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the beginning of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the beginning of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.match('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- fullmatch(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the whole string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the whole string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.fullmatch('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=False; matched=False>
- split(pattern: bytes | arkouda.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple#
Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur
- Parameters:
pattern (str) – Regex used to split strings into substrings
maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences
return_segments (bool) – If True, return mapping of original strings to first substring in return array.
- Returns:
Strings – Substrings with pattern matches removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.split('_+', maxsplit=2, return_segments=True) (array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
- findall(pattern: bytes | arkouda.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple#
Return a new Strings containg all non-overlapping matches of pattern
- Parameters:
pattern (str_scalars) – Regex used to find matches
return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from
- Returns:
Strings – Strings object containing only pattern matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.findall('_+', return_match_origins=True) (array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
- sub(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Strings#
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings with pattern matches replaced
- Return type:
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.sub(pattern='_+', repl='-', count=2) array(['1-2-', '-', '3', '-4-5____6___7', ''])
- subn(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Tuple#
Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings – Strings with pattern matches replaced
pdarray, int64 – The number of substitutions made for each element of Strings
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.subn(pattern='_+', repl='-', count=2) (array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
- contains(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element contains the given substring.
- Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> strings array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5']) >>> strings.contains('string') array([True, True, True, True, True]) >>> strings.contains('string \d', regex=True) array([True, True, True, True, True])
- startswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The prefix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not a bytes ior str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.startswith('string') array([True, True, True, True, True]) >>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.startswith('\d str', regex = True) array([True, True, True, True, True])
- endswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The suffix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.endswith('ing') array([True, True, True, True, True]) >>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.endswith('ing \d', regex = True) array([True, True, True, True, True])
- flatten(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple#
Unpack delimiter-joined substrings into a flat array.
- Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring in return array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> orig = ak.array(['one|two', 'three|four|five', 'six']) >>> orig.flatten('|') array(['one', 'two', 'three', 'four', 'five', 'six']) >>> flat, map = orig.flatten('|', return_segments=True) >>> map array([0, 2, 5]) >>> under = ak.array(['one_two', 'three_____four____five', 'six']) >>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True) >>> under_flat array(['one', 'two', 'three', 'four', 'five', 'six']) >>> under_map array([0, 2, 5])
- peel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple#
Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The field(s) peeled from the end of each string (unless fromRight is true)
- right: Strings
The remainder of each string after peeling (unless fromRight is true)
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g'])) >>> s.peel('.', includeDelimiter=True) (array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g'])) >>> s.peel('.', times=2) (array(['', '', 'e.f']), array(['a.b', 'c.d', 'g'])) >>> s.peel('.', times=2, keepPartial=True) (array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
- rpeel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)#
Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The remainder of the string after peeling
- right: Strings
The field(s) that were peeled from the right of each string
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.rpeel('.') (array(['a', 'c', 'e.f']), array(['b', 'd', 'g'])) # Compared against peel >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
- stick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '', toLeft: bool = False) Strings#
Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.
- Returns:
The array of joined strings
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance
ValueError – Raised if times is < 1
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.stick(t, delimiter='.') array(['a.b', 'c.d', 'e.f'])
- lstick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '') Strings#
Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
- Returns:
The array of joined strings, as other + self
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.lstick(t, delimiter='.') array(['b.a', 'd.c', 'f.e'])
- get_prefixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long prefix of each string, where possible
- Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.
- Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.
- get_suffixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long suffix of each string, where possible
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.
- Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Compute a 128-bit hash of each string.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- group() arkouda.pdarrayclass.pdarray#
Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
Notes
If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.
- Raises:
RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same strings as this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_ndarray() array(['hello', 'my', 'world'], dtype='<U5') >>> type(a.to_ndarray()) numpy.ndarray
- to_list() list#
Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list with the same strings as this SegString
- Return type:
list
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_list() ['hello', 'my', 'world'] >>> type(a.to_list()) list
- astype(dtype) arkouda.pdarrayclass.pdarray#
Cast values of Strings object to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str#
Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str#
Save the Strings object to HDF5. The object can be saved to a collection of files or single file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must have write permission.
Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path.If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a
RuntimeErrorwill result.Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
See also
- update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)#
Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)#
Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
str reponse message
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n) at this time.
- save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
- Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- register(user_defined_name: str) Strings#
Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.
- Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
- Returns:
The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- unregister() None#
Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- static attach(user_defined_name: str) Strings#
class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which the Strings object was registered under
- Returns:
the Strings object registered with user_defined_name in the arkouda server
- Return type:
Strings object
- Raises:
TypeError – Raised if user_defined_name is not a str
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- static unregister_strings_by_name(user_defined_name: str) None#
Unregister a Strings object in the arkouda server previously registered via register()
- Parameters:
user_defined_name (str) – The registered name of the Strings object
See also
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a Strings object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.BitVectorizer(width=64, reverse=False)#
Make a callback (i.e. function) that can be called on an array to create a BitVector.
- Parameters:
width (int) – The number of bit fields in the vector
reverse (bool) – If True, display bits from least significant (left) to most significant (right). By default, the most significant bit is the left-most bit.
- Returns:
bitvectorizer – A function that takes an array and returns a BitVector instance
- Return type:
callable
- class arkouda.BitVector(values, width=64, reverse=False)#
Bases:
arkouda.pdarrayclass.pdarrayRepresent integers as bit vectors, e.g. a set of flags.
- Parameters:
values (pdarray, int64) – The integers to represent as bit vectors
width (int) – The number of bit fields in the vector
reverse (bool) – If True, display bits from least significant (left) to most significant (right). By default, the most significant bit is the left-most bit.
- Returns:
bitvectors – The array of binary vectors
- Return type:
Notes
This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like a uint64 pdarray.
- conserves#
- special_objType = 'BitVector'#
- format(x)#
Format a single binary vector as a string.
- to_ndarray()#
Export data to a numpy array of string-formatted bit vectors.
- to_list()#
Export data to a list of string-formatted bit vectors.
- opeq(other, op)#
- register(user_defined_name)#
Register this BitVector object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the BitVector is to be registered under, this will be the root name for underlying components
- Returns:
The same BitVector which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different BitVectors with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the BitVector with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- classmethod from_return_msg(rep_msg)#
- class arkouda.Fields(values, names, MSB_left=True, pad='-', separator='', show_int=True)#
Bases:
BitVectorAn integer-backed representation of a set of named binary fields, e.g. flags.
- Parameters:
values (pdarray or Strings) – The array of field values. If (u)int64, the values are used as-is for the binary representation of fields. If Strings, the values are converted to binary according to the mapping defined by the names and MSB_left arguments.
names (str or sequence of str) – The names of the fields, in order. A string will be treated as a list of single-character field names. Multi-character field names are allowed, but must be passed as a list or tuple and user must specify a separator.
MSB_left (bool) – Controls how field names are mapped to binary values. If True (default), the left-most field name corresponds to the most significant bit in the binary representation. If False, the left-most field name corresponds to the least significant bit.
pad (str) – Character to display when field is not present. Use empty string if no padding is desired.
separator (str) – Substring that separates fields. Used to parse input values (if ak.Strings) and to display output.
show_int (bool) – If True (default), display the integer value of the binary fields in output.
- Returns:
fields – The array of field values
- Return type:
Notes
This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like an int64 pdarray.
- format(x)#
Format a single binary value as a string of named fields.
- opeq(other, op)#
- arkouda.ip_address(values)#
Convert values to an Arkouda array of IP addresses.
- Parameters:
values (list-like, integer pdarray, or IPv4) – The integer IP addresses or IPv4 object.
- Returns:
The same IP addresses as an Arkouda array
- Return type:
Notes
This helper is intended to help future proof changes made to accomodate IPv6 and to prevent errors if a user inadvertently casts a IPv4 instead of a int64 pdarray. It can also be used for importing Python lists of IP addresses into Arkouda.
- class arkouda.IPv4(values)#
Bases:
arkouda.pdarrayclass.pdarrayRepresent integers as IPv4 addresses.
- Parameters:
values (pdarray, int64) – The integer IP addresses
- Returns:
The same IP addresses
- Return type:
Notes
This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like an int64 pdarray.
- special_objType = 'IPv4'#
- export_uint()#
- format(x)#
Format a single integer IP address as a string.
- normalize(x)#
Take in an IP address as a string, integer, or IPAddress object, and convert it to an integer.
- to_ndarray()#
Export array as a numpy array of integers.
- to_list()#
Export array as a list of integers.
- opeq(other, op)#
- register(user_defined_name)#
Register this IPv4 object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the IPv4 is to be registered under, this will be the root name for underlying components
- Returns:
The same IPv4 which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different IPv4s with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the IPv4 with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute')#
Override of the pdarray to_hdf to store the special object type
- update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)#
Override the pdarray implementation so that the special object type will be used.
- arkouda.is_ipv4(ip: arkouda.pdarrayclass.pdarray | IPv4, ip2: arkouda.pdarrayclass.pdarray | None = None) arkouda.pdarrayclass.pdarray#
Indicate which values are ipv4 when passed data containing IPv4 and IPv6 values.
- Parameters:
- Return type:
pdarray of bools indicating which indexes are IPv4.
See also
ak.is_ipv6
- arkouda.is_ipv6(ip: arkouda.pdarrayclass.pdarray | IPv4, ip2: arkouda.pdarrayclass.pdarray | None = None) arkouda.pdarrayclass.pdarray#
Indicate which values are ipv6 when passed data containing IPv4 and IPv6 values.
- Parameters:
- Return type:
pdarray of bools indicating which indexes are IPv6.
See also
ak.is_ipv4
- arkouda.DTypes#
- arkouda.DTypeObjects#
- arkouda.dtype(x)#
- arkouda.bool#
- arkouda.int64#
- arkouda.float64#
- arkouda.uint8#
- arkouda.uint64#
- arkouda.str_#
- arkouda.bigint#
- arkouda.intTypes#
- arkouda.bitType#
- arkouda.check_np_dtype(dt: numpy.dtype) None#
Assert that numpy dtype dt is one of the dtypes supported by arkouda, otherwise raise TypeError.
- Raises:
TypeError – Raised if the dtype is not in supported dtypes or if dt is not a np.dtype
- arkouda.translate_np_dtype(dt: numpy.dtype) Tuple[str, int]#
Split numpy dtype dt into its kind and byte size, raising TypeError for unsupported dtypes.
- Raises:
TypeError – Raised if the dtype is not in supported dtypes or if dt is not a np.dtype
- arkouda.resolve_scalar_dtype(val: object) str#
Try to infer what dtype arkouda_server should treat val as.
- arkouda.ARKOUDA_SUPPORTED_DTYPES#
- arkouda.bool_scalars#
- arkouda.float_scalars#
- arkouda.int_scalars#
- arkouda.numeric_scalars#
- arkouda.numpy_scalars#
- arkouda.str_scalars#
- arkouda.all_scalars#
The DType enum defines the supported Arkouda data types in string form.
- arkouda.get_byteorder(dt: numpy.dtype) str#
Get a concrete byteorder (turns ‘=’ into ‘<’ or ‘>’)
- arkouda.get_server_byteorder() str#
Get the server’s byteorder
- arkouda.isSupportedNumber(num)#
- class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.dtypes.int_scalars, ndim: arkouda.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.dtypes.int_scalars, max_bits: int | None = None)#
The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.
- name#
The server-side identifier for the array
- Type:
str
- dtype#
The element type of the array
- Type:
dtype
- size#
The number of elements in the array
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
A list or tuple containing the sizes of each dimension of the array
- Type:
Sequence[int]
- itemsize#
The size in bytes of each element
- Type:
int_scalars
- property max_bits#
- BinOps#
- OpEqOps#
- objType = 'pdarray'#
- format_other(other) str#
Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.
- Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
- Return type:
string representation of np.dtype corresponding to the other parameter
- Raises:
TypeError – Raised if the other parameter cannot be converted to Numpy dtype
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a pdarray to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- opeq(other, op)#
- fill(value: arkouda.dtypes.numeric_scalars) None#
Fill the array (in place) with a constant value.
- Parameters:
value (numeric_scalars) –
- Raises:
TypeError – Raised if value is not an int, int64, float, or float64
- any() numpy.bool_#
Return True iff any element of the array evaluates to True.
- all() numpy.bool_#
Return True iff all elements of the array evaluate to True.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
Note
This will return True if the object is registered itself or as a component of another object
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- is_sorted() numpy.bool_#
Return True iff the array is monotonically non-decreasing.
- Parameters:
None –
- Returns:
Indicates if the array is monotonically non-decreasing
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- sum() arkouda.dtypes.numeric_and_bool_scalars#
Return the sum of all elements in the array.
- prod() numpy.float64#
Return the product of all elements in the array. Return value is always a np.float64 or np.int64.
- min() arkouda.dtypes.numpy_scalars#
Return the minimum value of the array.
- max() arkouda.dtypes.numpy_scalars#
Return the maximum value of the array.
- argmin() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array min value
- argmax() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array max value.
- mean() numpy.float64#
Return the mean of the array.
- var(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the variance. See
arkouda.varfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
The scalar variance of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown
- std(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the standard deviation. See
arkouda.stdfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
The scalar standard deviation of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- cov(y: pdarray) numpy.float64#
Compute the covariance between self and y.
- Parameters:
y (pdarray) – Other pdarray used to calculate covariance
- Returns:
The scalar covariance of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- corr(y: pdarray) numpy.float64#
Compute the correlation between self and y using pearson correlation coefficient.
- Parameters:
y (pdarray) – Other pdarray used to calculate correlation
- Returns:
The scalar correlation of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- mink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- maxk(k: arkouda.dtypes.int_scalars) pdarray#
Compute the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmaxk(k: arkouda.dtypes.int_scalars) pdarray#
Finds the indices corresponding to the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- value_counts()#
Count the occurrences of the unique values of self.
- Returns:
unique_values (pdarray) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs
Examples
>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts() (array([0, 2, 4]), array([3, 2, 1]))
- astype(dtype) pdarray#
Cast values of pdarray to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- slice_bits(low, high) pdarray#
Returns a pdarray containing only bits from low to high of self.
This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)
- Parameters:
low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0
high (int) – The highest bit included in the slice (inclusive)
- Returns:
A new pdarray containing the bits of self from low to high
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> p = ak.array([2**65 + (2**64 - 1)]) >>> bin(p[0]) '0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0]) '0b10'
- bigint_to_uint_arrays() List[pdarray]#
Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Returns:
A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Return type:
List[pdarrays]
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> a = ak.arange(2**64, 2**64 + 5) >>> a array(["18446744073709551616" "18446744073709551617" "18446744073709551618" "18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays() [array([1 1 1 1 1]), array([0 1 2 3 4])]
- reshape(*shape, order='row_major')#
Gives a new shape to an array without changing its data.
- Parameters:
shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.
order (str {'row_major' | 'C' | 'column_major' | 'F'}) – Read the elements of the pdarray in this index order By default, read the elements in row_major or C-like order where the last index changes the fastest If ‘column_major’ or ‘F’, read the elements in column_major or Fortran-like order where the first index changes the fastest
- Returns:
An arrayview object with the data from the array but with the new shape
- Return type:
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same attributes and data as the pdarray
- Return type:
np.ndarray
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_ndarray() array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray()) numpy.ndarray
- to_list() List#
Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A list with the same data as the pdarray
- Return type:
list
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_list() [0, 1, 2, 3, 4]
>>> type(a.to_list()) list
- to_cuda()#
Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.
- Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
- Return type:
numba.DeviceNDArray
- Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_cuda() array([0, 1, 2, 3, 4])
>>> type(a.to_cuda()) numpy.devicendarray
- to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str#
Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_parquet('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_parqet('path/prefix.parquet', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str#
Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_hdf('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_hdf('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving to a single file >>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single') Saves the array in to single hdf5 file on the root node. ``cwd/path/name_prefix.hdf5``
- update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)#
Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)#
Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.
- prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- dataset: str
Column name to save the pdarray under. Defaults to “array”.
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
str reponse message
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string
See also
save_all,load,read,to_parquet,to_hdfNotes
The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. Previously all files saved in Parquet format were saved with a.parquetfile extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.save('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.save('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving with an extension (Parquet) >>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet') Saves the array in numLocales Parquet files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- register(user_defined_name: str) pdarray#
Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.
- Parameters:
user_defined_name (str) – user defined name array is to be registered under
- Returns:
The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.
See also
attach,unregister,is_registered,list_registry,unregister_pdarray_by_nameNotes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- unregister() None#
Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- static attach(user_defined_name: str) pdarray#
class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which array was registered under
- Returns:
pdarray which is bound to the corresponding server side component which was registered with user_defined_name
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- arkouda.clear() None#
Send a clear message to clear all unregistered data from the server symbol table
- Return type:
None
- Raises:
RuntimeError – Raised if there is a server-side error in executing clear request
- arkouda.any(pda: pdarray) numpy.bool_#
Return True iff any element of the array evaluates to True.
- Parameters:
pda (pdarray) – The pdarray instance to be evaluated
- Returns:
Indicates if 1..n pdarray elements evaluate to True
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.all(pda: pdarray) numpy.bool_#
Return True iff all elements of the array evaluate to True.
- Parameters:
pda (pdarray) – The pdarray instance to be evaluated
- Returns:
Indicates if all pdarray elements evaluate to True
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.is_sorted(pda: pdarray) numpy.bool_#
Return True iff the array is monotonically non-decreasing.
- Parameters:
pda (pdarray) – The pdarray instance to be evaluated
- Returns:
Indicates if the array is monotonically non-decreasing
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.sum(pda: pdarray) numpy.float64#
Return the sum of all elements in the array.
- Parameters:
pda (pdarray) – Values for which to calculate the sum
- Returns:
The sum of all elements in the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.prod(pda: pdarray) numpy.float64#
Return the product of all elements in the array. Return value is always a np.float64 or np.int64
- Parameters:
pda (pdarray) – Values for which to calculate the product
- Returns:
The product calculated from the pda
- Return type:
numpy_scalars
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.min(pda: pdarray) arkouda.dtypes.numpy_scalars#
Return the minimum value of the array.
- Parameters:
pda (pdarray) – Values for which to calculate the min
- Returns:
The min calculated from the pda
- Return type:
numpy_scalars
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.max(pda: pdarray) arkouda.dtypes.numpy_scalars#
Return the maximum value of the array.
- Parameters:
pda (pdarray) – Values for which to calculate the max
- Returns:
The max calculated from the pda
- Return type:
numpy_scalars
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.argmin(pda: pdarray) numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array min value.
- Parameters:
pda (pdarray) – Values for which to calculate the argmin
- Returns:
The index of the argmin calculated from the pda
- Return type:
Union[np.int64, np.uint64]
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.argmax(pda: pdarray) numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array max value.
- Parameters:
pda (pdarray) – Values for which to calculate the argmax
- Returns:
The index of the argmax calculated from the pda
- Return type:
Union[np.int64, np.uint64]
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.mean(pda: pdarray) numpy.float64#
Return the mean of the array.
- Parameters:
pda (pdarray) – Values for which to calculate the mean
- Returns:
The mean calculated from the pda sum and size
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.var(pda: pdarray, ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Return the variance of values in the array.
- Parameters:
pda (pdarray) – Values for which to calculate the variance
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
The scalar variance of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown
Notes
The variance is the average of the squared deviations from the mean, i.e.,
var = mean((x - x.mean())**2).The mean is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of a hypothetical infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables.
- arkouda.std(pda: pdarray, ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Return the standard deviation of values in the array. The standard deviation is implemented as the square root of the variance.
- Parameters:
pda (pdarray) – values for which to calculate the standard deviation
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
The scalar standard deviation of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance or ddof is not an integer
ValueError – Raised if ddof is an integer < 0
RuntimeError – Raised if there’s a server-side error thrown
Notes
The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,
std = sqrt(mean((x - x.mean())**2)).The average squared deviation is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of the infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1, it will not be an unbiased estimate of the standard deviation per se.
- arkouda.mink(pda: pdarray, k: arkouda.dtypes.int_scalars) pdarray#
Find the k minimum values of an array.
Returns the smallest k values of an array, sorted
- Parameters:
pda (pdarray) – Input array.
k (int_scalars) – The desired count of minimum values to be returned by the output.
- Returns:
The minimum k values from pda, sorted
- Return type:
- Raises:
TypeError – Raised if pda is not a pdarray
ValueError – Raised if the pda is empty or k < 1
Notes
This call is equivalent in value to:
a[ak.argsort(a)[:k]]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows beyond a certain value. This value is system dependent, but generally about a k of 5 million is where performance degredation has been observed.
Examples
>>> A = ak.array([10,5,1,3,7,2,9,0]) >>> ak.mink(A, 3) array([0, 1, 2]) >>> ak.mink(A, 4) array([0, 1, 2, 3])
- arkouda.maxk(pda: pdarray, k: arkouda.dtypes.int_scalars) pdarray#
Find the k maximum values of an array.
Returns the largest k values of an array, sorted
- Parameters:
pda (pdarray) – Input array.
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray or k is not an integer
ValueError – Raised if the pda is empty or k < 1
Notes
This call is equivalent in value to:
a[ak.argsort(a)[k:]]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows beyond a certain value. This value is system dependent, but generally about a k of 5 million is where performance degredation has been observed.
Examples
>>> A = ak.array([10,5,1,3,7,2,9,0]) >>> ak.maxk(A, 3) array([7, 9, 10]) >>> ak.maxk(A, 4) array([5, 7, 9, 10])
- arkouda.argmink(pda: pdarray, k: arkouda.dtypes.int_scalars) pdarray#
Finds the indices corresponding to the k minimum values of an array.
- Parameters:
pda (pdarray) – Input array.
k (int_scalars) – The desired count of indices corresponding to minimum array values
- Returns:
The indices of the minimum k values from the pda, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray or k is not an integer
ValueError – Raised if the pda is empty or k < 1
Notes
This call is equivalent in value to:
ak.argsort(a)[:k]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows beyond a certain value. This value is system dependent, but generally about a k of 5 million is where performance degradation has been observed.
Examples
>>> A = ak.array([10,5,1,3,7,2,9,0]) >>> ak.argmink(A, 3) array([7, 2, 5]) >>> ak.argmink(A, 4) array([7, 2, 5, 3])
- arkouda.argmaxk(pda: pdarray, k: arkouda.dtypes.int_scalars) pdarray#
Find the indices corresponding to the k maximum values of an array.
Returns the largest k values of an array, sorted
- Parameters:
pda (pdarray) – Input array.
k (int_scalars) – The desired count of indices corresponding to maxmum array values
- Returns:
The indices of the maximum k values from the pda, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray or k is not an integer
ValueError – Raised if the pda is empty or k < 1
Notes
This call is equivalent in value to:
ak.argsort(a)[k:]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows beyond a certain value. This value is system dependent, but generally about a k of 5 million is where performance degradation has been observed.
Examples
>>> A = ak.array([10,5,1,3,7,2,9,0]) >>> ak.argmaxk(A, 3) array([4, 6, 0]) >>> ak.argmaxk(A, 4) array([1, 4, 6, 0])
- arkouda.popcount(pda: pdarray) pdarray#
Find the population (number of bits set) for each integer in an array.
- Parameters:
pda (pdarray, int64, uint64, bigint) – Input array (must be integral).
- Returns:
population – The number of bits set (1) in each element
- Return type:
- Raises:
TypeError – If input array is not int64, uint64, or bigint
Examples
>>> A = ak.arange(10) >>> ak.popcount(A) array([0, 1, 1, 2, 1, 2, 2, 3, 1, 2])
- arkouda.parity(pda: pdarray) pdarray#
Find the bit parity (XOR of all bits) for each integer in an array.
- Parameters:
pda (pdarray, int64, uint64, bigint) – Input array (must be integral).
- Returns:
parity – The parity of each element: 0 if even number of bits set, 1 if odd.
- Return type:
- Raises:
TypeError – If input array is not int64, uint64, or bigint
Examples
>>> A = ak.arange(10) >>> ak.parity(A) array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0])
- arkouda.clz(pda: pdarray) pdarray#
Count leading zeros for each integer in an array.
- Parameters:
pda (pdarray, int64, uint64, bigint) – Input array (must be integral).
- Returns:
lz – The number of leading zeros of each element.
- Return type:
- Raises:
TypeError – If input array is not int64, uint64, or bigint
Examples
>>> A = ak.arange(10) >>> ak.clz(A) array([64, 63, 62, 62, 61, 61, 61, 61, 60, 60])
- arkouda.ctz(pda: pdarray) pdarray#
Count trailing zeros for each integer in an array.
- Parameters:
pda (pdarray, int64, uint64, bigint) – Input array (must be integral).
- Returns:
lz – The number of trailing zeros of each element.
- Return type:
Notes
ctz(0) is defined to be zero.
- Raises:
TypeError – If input array is not int64, uint64, or bigint
Examples
>>> A = ak.arange(10) >>> ak.ctz(A) array([0, 0, 1, 0, 2, 0, 1, 0, 3, 0])
- arkouda.rotl(x, rot) pdarray#
Rotate bits of <x> to the left by <rot>.
- Parameters:
- Returns:
rotated – The rotated elements of x.
- Return type:
pdarray(int64/uint64)
- Raises:
TypeError – If input array is not int64 or uint64
Examples
>>> A = ak.arange(10) >>> ak.rotl(A, A) array([0, 2, 8, 24, 64, 160, 384, 896, 2048, 4608])
- arkouda.rotr(x, rot) pdarray#
Rotate bits of <x> to the left by <rot>.
- Parameters:
- Returns:
rotated – The rotated elements of x.
- Return type:
pdarray(int64/uint64)
- Raises:
TypeError – If input array is not int64 or uint64
Examples
>>> A = ak.arange(10) >>> ak.rotr(1024 * A, A) array([0, 512, 512, 384, 256, 160, 96, 56, 32, 18])
- arkouda.cov(x: pdarray, y: pdarray) numpy.float64#
Return the covariance of x and y
- Parameters:
- Returns:
The scalar covariance of the two pdarrays
- Return type:
np.float64
- Raises:
TypeError – Raised if x or y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Notes
The covariance is calculated by
cov = ((x - x.mean()) * (y - y.mean())).sum() / (x.size - 1).
- arkouda.corr(x: pdarray, y: pdarray) numpy.float64#
Return the correlation between x and y
- Parameters:
- Returns:
The scalar correlation of the two pdarrays
- Return type:
np.float64
- Raises:
TypeError – Raised if x or y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Notes
The correlation is calculated by cov(x, y) / (x.std(ddof=1) * y.std(ddof=1))
- arkouda.divmod(x: arkouda.dtypes.numeric_scalars | pdarray, y: arkouda.dtypes.numeric_scalars | pdarray, where: bool | pdarray = True) Tuple[pdarray, pdarray]#
- Parameters:
x (numeric_scalars(float_scalars, int_scalars) or pdarray) – The dividend array, the values that will be the numerator of the floordivision and will be acted on by the bases for modular division.
y (numeric_scalars(float_scalars, int_scalars) or pdarray) – The divisor array, the values that will be the denominator of the division and will be the bases for the modular division.
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the corresponding value will be divided using floor and modular division. Elsewhere, it will retain its original value. Default set to True.
- Returns:
Returns a tuple that contains quotient and remainder of the division
- Return type:
- Raises:
TypeError – At least one entry must be a pdarray
ValueError – If both inputs are both pdarrays, their size must match
ZeroDivisionError – No entry in y is allowed to be 0, to prevent division by zero
Notes
The div is calculated by x // y The mod is calculated by x % y
Examples
>>> x = ak.arange(5, 10) >>> y = ak.array([2, 1, 4, 5, 8]) >>> ak.divmod(x,y) (array([2 6 1 1 1]), array([1 0 3 3 1])) >>> ak.divmod(x,y, x % 2 == 0) (array([5 6 7 1 9]), array([5 0 7 3 9]))
- arkouda.sqrt(pda: pdarray, where: bool | pdarray = True) pdarray#
Takes the square root of array. If where is given, the operation will only take place in the positions where the where condition is True.
- Parameters:
- Returns:
pdarray
Returns a pdarray of square rooted values, under the boolean where condition.
Examples: >>> a = ak.arange(5) >>> ak.sqrt(a) array([0 1 1.4142135623730951 1.7320508075688772 2]) >>> ak.sqrt(a, ak.sqrt([True, True, False, False, True])) array([0, 1, 2, 3, 2])
- arkouda.power(pda: pdarray, pwr: int | float | pdarray, where: bool | pdarray = True) pdarray#
Raises an array to a power. If where is given, the operation will only take place in the positions where the where condition is True.
Note: Our implementation of the where argument deviates from numpy. The difference in behavior occurs at positions where the where argument contains a False. In numpy, these position will have uninitialized memory (which can contain anything and will vary between runs). We have chosen to instead return the value of the original array in these positions.
- Parameters:
pda (pdarray) – A pdarray of values that will be raised to a power (pwr)
pwr (integer, float, or pdarray) – The power(s) that pda is raised to
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the corresponding value will be raised to the respective power. Elsewhere, it will retain its original value. Default set to True.
- Returns:
pdarray
Returns a pdarray of values raised to a power, under the boolean where condition.
Examples
>>> a = ak.arange(5) >>> ak.power(a, 3) array([0, 1, 8, 27, 64]) >>> ak.power(a), 3, a % 2 == 0) array([0, 1, 8, 3, 64])
- arkouda.mod(dividend, divisor) pdarray#
Returns the element-wise remainder of division.
Computes the remainder complementary to the floor_divide function. It is equivalent to np.mod, the remainder has the same sign as the divisor.
- Parameters:
dividend – The array being acted on by the bases for the modular division.
divisor – The array that will be the bases for the modular division.
- Returns:
Returns an array that contains the element-wise remainder of division.
- Return type:
- arkouda.fmod(dividend: pdarray | arkouda.dtypes.numeric_scalars, divisor: pdarray | arkouda.dtypes.numeric_scalars) pdarray#
Returns the element-wise remainder of division.
It is equivalent to np.fmod, the remainder has the same sign as the dividend.
- Parameters:
- Returns:
Returns an array that contains the element-wise remainder of division.
- Return type:
- arkouda.attach_pdarray(user_defined_name: str) pdarray#
class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which array was registered under
- Returns:
pdarray which is bound to the corresponding server side component which was registered with user_defined_name
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
See also
attach,register,unregister,is_registered,unregister_pdarray_by_name,list_registryNotes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.attach_pdarray("my_zeros") >>> # ...other work... >>> b.unregister()
- arkouda.unregister_pdarray_by_name(user_defined_name: str) None#
Unregister a named pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach_pdarray()
- Parameters:
user_defined_name (str) – user defined name which array was registered under
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
See also
register,unregister,is_registered,list_registry,attachNotes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.attach_pdarray("my_zeros") >>> # ...other work... >>> ak.unregister_pdarray_by_name(b)
- exception arkouda.RegistrationError#
Bases:
ExceptionError/Exception used when the Arkouda Server cannot register an object
- arkouda.in1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False, symmetric: bool = False, invert: bool = False) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable#
Test whether each element of a 1-D array is also present in a second array.
Returns a boolean array the same length as pda1 that is True where an element of pda1 is in pda2 and False otherwise.
Support multi-level – test membership of rows of a in the set of rows of b.
- Parameters:
a (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements for which to test membership in b
b (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements of the set in which to test membership
assume_unique (bool) – If true, assume rows of a and b are each unique and sorted. By default, sort and unique them explicitly.
symmetric (bool) – Return in1d(pda1, pda2), in1d(pda2, pda1) when pda1 and 2 are single items.
invert (bool, optional) – If True, the values in the returned array are inverted (that is, False where an element of pda1 is in pda2 and True otherwise). Default is False.
ak.in1d(a, b, invert=True)is equivalent to (but is faster than)~ak.in1d(a, b).
- Return type:
True for each row in a that is contained in b
Return Type#
pdarray, bool
Notes
Only works for pdarrays of int64 dtype, Strings, or Categorical
- arkouda.concatenate(arrays: Sequence[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical], ordered: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical#
Concatenate a list or tuple of
pdarrayorStringsobjects into onepdarrayorStringsobject, respectively.- Parameters:
arrays (Sequence[Union[pdarray,Strings,Categorical]]) – The arrays to concatenate. Must all have same dtype.
ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.
- Returns:
Single pdarray or Strings object containing all values, returned in the original order
- Return type:
Union[pdarray,Strings,Categorical]
- Raises:
ValueError – Raised if arrays is empty or if 1..n pdarrays have differing dtypes
TypeError – Raised if arrays is not a pdarrays or Strings python Sequence such as a list or tuple
RuntimeError – Raised if 1..n array elements are dtypes for which concatenate has not been implemented.
Examples
>>> ak.concatenate([ak.array([1, 2, 3]), ak.array([4, 5, 6])]) array([1, 2, 3, 4, 5, 6])
>>> ak.concatenate([ak.array([True,False,True]),ak.array([False,True,True])]) array([True, False, True, False, True, True])
>>> ak.concatenate([ak.array(['one','two']),ak.array(['three','four','five'])]) array(['one', 'two', 'three', 'four', 'five'])
- arkouda.union1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable#
Find the union of two arrays/List of Arrays.
Return the unique, sorted array of values that are in either of the two input arrays.
- Parameters:
pda1 (pdarray/Sequence[pdarray, Strings, Categorical]) – Input array/Sequence of groupable objects
pda2 (pdarray/List) – Input array/sequence of groupable objects
- Returns:
Unique, sorted union of the input arrays.
- Return type:
pdarray/groupable
- Raises:
TypeError – Raised if either pda1 or pda2 is not a pdarray
RuntimeError – Raised if the dtype of either array is not supported
See also
Notes
ak.union1d is not supported for bool or float64 pdarrays
Examples
>>> # 1D Example >>> ak.union1d(ak.array([-1, 0, 1]), ak.array([-2, 0, 2])) array([-2, -1, 0, 1, 2]) #Multi-Array Example >>> a = ak.arange(1, 6) >>> b = ak.array([1, 5, 3, 4, 2]) >>> c = ak.array([1, 4, 3, 2, 5]) >>> d = ak.array([1, 2, 3, 5, 4]) >>> multia = [a, a, a] >>> multib = [b, c, d] >>> ak.union1d(multia, multib) [array[1, 2, 2, 3, 4, 4, 5, 5], array[1, 2, 5, 3, 2, 4, 4, 5], array[1, 2, 4, 3, 5, 4, 2, 5]]
- arkouda.intersect1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable#
Find the intersection of two arrays.
Return the sorted, unique values that are in both of the input arrays.
- Parameters:
pda1 (pdarray/Sequence[pdarray, Strings, Categorical]) – Input array/Sequence of groupable objects
pda2 (pdarray/List) – Input array/sequence of groupable objects
assume_unique (bool) – If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.
- Returns:
Sorted 1D array/List of sorted pdarrays of common and unique elements.
- Return type:
pdarray/groupable
- Raises:
TypeError – Raised if either pda1 or pda2 is not a pdarray
RuntimeError – Raised if the dtype of either pdarray is not supported
See also
Notes
ak.intersect1d is not supported for bool or float64 pdarrays
Examples
>>> # 1D Example >>> ak.intersect1d([1, 3, 4, 3], [3, 1, 2, 1]) array([1, 3]) # Multi-Array Example >>> a = ak.arange(5) >>> b = ak.array([1, 5, 3, 4, 2]) >>> c = ak.array([1, 4, 3, 2, 5]) >>> d = ak.array([1, 2, 3, 5, 4]) >>> multia = [a, a, a] >>> multib = [b, c, d] >>> ak.intersect1d(multia, multib) [array([1, 3]), array([1, 3]), array([1, 3])]
- arkouda.setdiff1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable#
Find the set difference of two arrays.
Return the sorted, unique values in pda1 that are not in pda2.
- Parameters:
pda1 (pdarray/Sequence[pdarray, Strings, Categorical]) – Input array/Sequence of groupable objects
pda2 (pdarray/List) – Input array/sequence of groupable objects
assume_unique (bool) – If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.
- Returns:
Sorted 1D array/List of sorted pdarrays of values in pda1 that are not in pda2.
- Return type:
pdarray/groupable
- Raises:
TypeError – Raised if either pda1 or pda2 is not a pdarray
RuntimeError – Raised if the dtype of either pdarray is not supported
See also
Notes
ak.setdiff1d is not supported for bool or float64 pdarrays
Examples
>>> a = ak.array([1, 2, 3, 2, 4, 1]) >>> b = ak.array([3, 4, 5, 6]) >>> ak.setdiff1d(a, b) array([1, 2]) #Multi-Array Example >>> a = ak.arange(1, 6) >>> b = ak.array([1, 5, 3, 4, 2]) >>> c = ak.array([1, 4, 3, 2, 5]) >>> d = ak.array([1, 2, 3, 5, 4]) >>> multia = [a, a, a] >>> multib = [b, c, d] >>> ak.setdiff1d(multia, multib) [array([2, 4, 5]), array([2, 4, 5]), array([2, 4, 5])]
- arkouda.setxor1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable#
Find the set exclusive-or (symmetric difference) of two arrays.
Return the sorted, unique values that are in only one (not both) of the input arrays.
- Parameters:
pda1 (pdarray/Sequence[pdarray, Strings, Categorical]) – Input array/Sequence of groupable objects
pda2 (pdarray/List) – Input array/sequence of groupable objects
assume_unique (bool) – If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.
- Returns:
Sorted 1D array/List of sorted pdarrays of unique values that are in only one of the input arrays.
- Return type:
pdarray/groupable
- Raises:
TypeError – Raised if either pda1 or pda2 is not a pdarray
RuntimeError – Raised if the dtype of either pdarray is not supported
Notes
ak.setxor1d is not supported for bool or float64 pdarrays
Examples
>>> a = ak.array([1, 2, 3, 2, 4]) >>> b = ak.array([2, 3, 5, 7, 5]) >>> ak.setxor1d(a,b) array([1, 4, 5, 7]) #Multi-Array Example >>> a = ak.arange(1, 6) >>> b = ak.array([1, 5, 3, 4, 2]) >>> c = ak.array([1, 4, 3, 2, 5]) >>> d = ak.array([1, 2, 3, 5, 4]) >>> multia = [a, a, a] >>> multib = [b, c, d] >>> ak.setxor1d(multia, multib) [array([2, 2, 4, 4, 5, 5]), array([2, 5, 2, 4, 4, 5]), array([2, 4, 5, 4, 2, 5])]
- arkouda.array(a: arkouda.pdarrayclass.pdarray | numpy.ndarray | Iterable, dtype: numpy.dtype | type | str = None, max_bits: int = -1) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Convert a Python or Numpy Iterable to a pdarray or Strings object, sending the corresponding data to the arkouda server.
- Parameters:
a (Union[pdarray, np.ndarray]) – Rank-1 array of a supported dtype
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
A pdarray instance stored on arkouda server or Strings instance, which is composed of two pdarrays stored on arkouda server
- Return type:
- Raises:
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a list, array, tuple, or deque
RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is not supported (not in DTypes), or if the product of a size and a.itemsize > maxTransferBytes
ValueError – Raised if the returned message is malformed or does not contain the fields required to generate the array.
See also
Notes
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overwhelming the connection between the Python client and the arkouda server, under the assumption that it is a low-bandwidth connection. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively to create the Strings object and the two corresponding pdarrays for string bytes and offsets, respectively.
Examples
>>> ak.array(np.arange(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> ak.array(range(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> strings = ak.array([f'string {i}' for i in range(0,5)]) >>> type(strings) <class 'arkouda.strings.Strings'>
- arkouda.zeros(size: arkouda.dtypes.int_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray#
Create a pdarray filled with zeros.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
dtype (all_scalars) – Type of resulting array, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Zeros of the requested size and dtype
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
See also
Examples
>>> ak.zeros(5, dtype=ak.int64) array([0, 0, 0, 0, 0])
>>> ak.zeros(5, dtype=ak.float64) array([0, 0, 0, 0, 0])
>>> ak.zeros(5, dtype=ak.bool) array([False, False, False, False, False])
- arkouda.ones(size: arkouda.dtypes.int_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray#
Create a pdarray filled with ones.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
dtype (Union[float64, int64, bool]) – Resulting array type, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Ones of the requested size and dtype
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
Examples
>>> ak.ones(5, dtype=ak.int64) array([1, 1, 1, 1, 1])
>>> ak.ones(5, dtype=ak.float64) array([1, 1, 1, 1, 1])
>>> ak.ones(5, dtype=ak.bool) array([True, True, True, True, True])
- arkouda.full(size: arkouda.dtypes.int_scalars | str, fill_value: arkouda.dtypes.numeric_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Create a pdarray filled with fill_value.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
fill_value (int_scalars) – Value with which the array will be filled
dtype (all_scalars) – Resulting array type, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
array of the requested size and dtype filled with fill_value
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
Examples
>>> ak.full(5, 7, dtype=ak.int64) array([7, 7, 7, 7, 7])
>>> ak.full(5, 9, dtype=ak.float64) array([9, 9, 9, 9, 9])
>>> ak.full(5, 5, dtype=ak.bool) array([True, True, True, True, True])
- arkouda.zeros_like(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Create a zero-filled pdarray of the same size and dtype as an existing pdarray.
- Parameters:
pda (pdarray) – Array to use for size and dtype
- Returns:
Equivalent to ak.zeros(pda.size, pda.dtype)
- Return type:
- Raises:
TypeError – Raised if the pda parameter is not a pdarray.
Examples
>>> zeros = ak.zeros(5, dtype=ak.int64) >>> ak.zeros_like(zeros) array([0, 0, 0, 0, 0])
>>> zeros = ak.zeros(5, dtype=ak.float64) >>> ak.zeros_like(zeros) array([0, 0, 0, 0, 0])
>>> zeros = ak.zeros(5, dtype=ak.bool) >>> ak.zeros_like(zeros) array([False, False, False, False, False])
- arkouda.ones_like(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Create a one-filled pdarray of the same size and dtype as an existing pdarray.
- Parameters:
pda (pdarray) – Array to use for size and dtype
- Returns:
Equivalent to ak.ones(pda.size, pda.dtype)
- Return type:
- Raises:
TypeError – Raised if the pda parameter is not a pdarray.
See also
Notes
Logic for generating the pdarray is delegated to the ak.ones method. Accordingly, the supported dtypes match are defined by the ak.ones method.
Examples
>>> ones = ak.ones(5, dtype=ak.int64) >>> ak.ones_like(ones) array([1, 1, 1, 1, 1])
>>> ones = ak.ones(5, dtype=ak.float64) >>> ak.ones_like(ones) array([1, 1, 1, 1, 1])
>>> ones = ak.ones(5, dtype=ak.bool) >>> ak.ones_like(ones) array([True, True, True, True, True])
- arkouda.full_like(pda: arkouda.pdarrayclass.pdarray, fill_value: arkouda.dtypes.numeric_scalars) arkouda.pdarrayclass.pdarray#
Create a pdarray filled with fill_value of the same size and dtype as an existing pdarray.
- Parameters:
pda (pdarray) – Array to use for size and dtype
fill_value (int_scalars) – Value with which the array will be filled
- Returns:
Equivalent to ak.full(pda.size, fill_value, pda.dtype)
- Return type:
- Raises:
TypeError – Raised if the pda parameter is not a pdarray.
See also
Notes
Logic for generating the pdarray is delegated to the ak.full method. Accordingly, the supported dtypes match are defined by the ak.full method.
Examples
>>> full = ak.full(5, 7, dtype=ak.int64) >>> ak.full_like(full) array([7, 7, 7, 7, 7])
>>> full = ak.full(5, 9, dtype=ak.float64) >>> ak.full_like(full) array([9, 9, 9, 9, 9])
>>> full = ak.full(5, 5, dtype=ak.bool) >>> ak.full_like(full) array([True, True, True, True, True])
- arkouda.arange(*args, **kwargs) arkouda.pdarrayclass.pdarray#
arange([start,] stop[, stride,] dtype=int64)
Create a pdarray of consecutive integers within the interval [start, stop). If only one arg is given then arg is the stop parameter. If two args are given, then the first arg is start and second is stop. If three args are given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
- Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stop (int_scalars) – Stopping value (exclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1, if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Integers from start (inclusive) to stop (exclusive) by stride
- Return type:
pdarray, dtype
- Raises:
TypeError – Raised if start, stop, or stride is not an int object
ZeroDivisionError – Raised if stride == 0
Notes
Negative strides result in decreasing values. Currently, only int64 pdarrays can be created with this method. For float64 arrays, use the linspace method.
Examples
>>> ak.arange(0, 5, 1) array([0, 1, 2, 3, 4])
>>> ak.arange(5, 0, -1) array([5, 4, 3, 2, 1])
>>> ak.arange(0, 10, 2) array([0, 2, 4, 6, 8])
>>> ak.arange(-5, -10, -1) array([-5, -6, -7, -8, -9])
- arkouda.linspace(start: arkouda.dtypes.numeric_scalars, stop: arkouda.dtypes.numeric_scalars, length: arkouda.dtypes.int_scalars) arkouda.pdarrayclass.pdarray#
Create a pdarray of linearly-spaced floats in a closed interval.
- Parameters:
start (numeric_scalars) – Start of interval (inclusive)
stop (numeric_scalars) – End of interval (inclusive)
length (int_scalars) – Number of points
- Returns:
Array of evenly spaced float values along the interval
- Return type:
pdarray, float64
- Raises:
TypeError – Raised if start or stop is not a float or int or if length is not an int
See also
Notes
If that start is greater than stop, the pdarray values are generated in descending order.
Examples
>>> ak.linspace(0, 1, 5) array([0, 0.25, 0.5, 0.75, 1])
>>> ak.linspace(start=1, stop=0, length=5) array([1, 0.75, 0.5, 0.25, 0])
>>> ak.linspace(start=-5, stop=0, length=5) array([-5, -3.75, -2.5, -1.25, 0])
- arkouda.randint(low: arkouda.dtypes.numeric_scalars, high: arkouda.dtypes.numeric_scalars, size: arkouda.dtypes.int_scalars, dtype=akint64, seed: arkouda.dtypes.int_scalars = None) arkouda.pdarrayclass.pdarray#
Generate a pdarray of randomized int, float, or bool values in a specified range bounded by the low and high parameters.
- Parameters:
low (numeric_scalars) – The low value (inclusive) of the range
high (numeric_scalars) – The high value (exclusive for int, inclusive for float) of the range
size (int_scalars) – The length of the returned array
dtype (Union[int64, float64, bool]) – The dtype of the array
seed (int_scalars) – Index for where to pull the first returned value
- Returns:
Values drawn uniformly from the specified range having the desired dtype
- Return type:
- Raises:
TypeError – Raised if dtype.name not in DTypes, size is not an int, low or high is not an int or float, or seed is not an int
ValueError – Raised if size < 0 or if high < low
Notes
Calling randint with dtype=float64 will result in uniform non-integral floating point values.
Ranges >= 2**64 in size is undefined behavior because it exceeds the maximum value that can be stored on the server (uint64)
Examples
>>> ak.randint(0, 10, 5) array([5, 7, 4, 8, 3])
>>> ak.randint(0, 1, 3, dtype=ak.float64) array([0.92176432277231968, 0.083130710959903542, 0.68894208386667544])
>>> ak.randint(0, 1, 5, dtype=ak.bool) array([True, False, True, True, True])
>>> ak.randint(1, 5, 10, seed=2) array([4, 3, 1, 3, 4, 4, 2, 4, 3, 2])
>>> ak.randint(1, 5, 3, dtype=ak.float64, seed=2) array([2.9160772326374946, 4.353429832157099, 4.5392023718621486])
>>> ak.randint(1, 5, 10, dtype=ak.bool, seed=2) array([False, True, True, True, True, False, True, True, True, True])
- arkouda.uniform(size: arkouda.dtypes.int_scalars, low: arkouda.dtypes.numeric_scalars = float(0.0), high: arkouda.dtypes.numeric_scalars = 1.0, seed: None | arkouda.dtypes.int_scalars = None) arkouda.pdarrayclass.pdarray#
Generate a pdarray with uniformly distributed random float values in a specified range.
- Parameters:
low (float_scalars) – The low value (inclusive) of the range, defaults to 0.0
high (float_scalars) – The high value (inclusive) of the range, defaults to 1.0
size (int_scalars) – The length of the returned array
seed (int_scalars, optional) – Value used to initialize the random number generator
- Returns:
Values drawn uniformly from the specified range
- Return type:
pdarray, float64
- Raises:
TypeError – Raised if dtype.name not in DTypes, size is not an int, or if either low or high is not an int or float
ValueError – Raised if size < 0 or if high < low
Notes
The logic for uniform is delegated to the ak.randint method which is invoked with a dtype of float64
Examples
>>> ak.uniform(3) array([0.92176432277231968, 0.083130710959903542, 0.68894208386667544])
>>> ak.uniform(size=3,low=0,high=5,seed=0) array([0.30013431967121934, 0.47383036230759112, 1.0441791878997098])
- arkouda.standard_normal(size: arkouda.dtypes.int_scalars, seed: None | arkouda.dtypes.int_scalars = None) arkouda.pdarrayclass.pdarray#
Draw real numbers from the standard normal distribution.
- Parameters:
size (int_scalars) – The number of samples to draw (size of the returned array)
seed (int_scalars) – Value used to initialize the random number generator
- Returns:
The array of random numbers
- Return type:
pdarray, float64
- Raises:
TypeError – Raised if size is not an int
ValueError – Raised if size < 0
See also
Notes
For random samples from \(N(\mu, \sigma^2)\), use:
(sigma * standard_normal(size)) + muExamples
>>> ak.standard_normal(3,1) array([-0.68586185091150265, 1.1723810583573375, 0.567584107142031])
- arkouda.random_strings_uniform(minlen: arkouda.dtypes.int_scalars, maxlen: arkouda.dtypes.int_scalars, size: arkouda.dtypes.int_scalars, characters: str = 'uppercase', seed: None | arkouda.dtypes.int_scalars = None) arkouda.strings.Strings#
Generate random strings with lengths uniformly distributed between minlen and maxlen, and with characters drawn from a specified set.
- Parameters:
minlen (int_scalars) – The minimum allowed length of string
maxlen (int_scalars) – The maximum allowed length of string
size (int_scalars) – The number of strings to generate
characters ((uppercase, lowercase, numeric, printable, binary)) – The set of characters to draw from
seed (Union[None, int_scalars], optional) – Value used to initialize the random number generator
- Returns:
The array of random strings
- Return type:
- Raises:
ValueError – Raised if minlen < 0, maxlen < minlen, or size < 0
See also
Examples
>>> ak.random_strings_uniform(minlen=1, maxlen=5, seed=1, size=5) array(['TVKJ', 'EWAB', 'CO', 'HFMD', 'U'])
>>> ak.random_strings_uniform(minlen=1, maxlen=5, seed=1, size=5, ... characters='printable') array(['+5"f', '-P]3', '4k', '~HFF', 'F'])
- arkouda.random_strings_lognormal(logmean: arkouda.dtypes.numeric_scalars, logstd: arkouda.dtypes.numeric_scalars, size: arkouda.dtypes.int_scalars, characters: str = 'uppercase', seed: arkouda.dtypes.int_scalars | None = None) arkouda.strings.Strings#
Generate random strings with log-normally distributed lengths and with characters drawn from a specified set.
- Parameters:
logmean (numeric_scalars) – The log-mean of the length distribution
logstd (numeric_scalars) – The log-standard-deviation of the length distribution
size (int_scalars) – The number of strings to generate
characters ((uppercase, lowercase, numeric, printable, binary)) – The set of characters to draw from
seed (int_scalars, optional) – Value used to initialize the random number generator
- Returns:
The Strings object encapsulating a pdarray of random strings
- Return type:
- Raises:
TypeError – Raised if logmean is neither a float nor a int, logstd is not a float, size is not an int, or if characters is not a str
ValueError – Raised if logstd <= 0 or size < 0
See also
Notes
The lengths of the generated strings are distributed $Lognormal(mu, sigma^2)$, with \(\mu = logmean\) and \(\sigma = logstd\). Thus, the strings will have an average length of \(exp(\mu + 0.5*\sigma^2)\), a minimum length of zero, and a heavy tail towards longer strings.
Examples
>>> ak.random_strings_lognormal(2, 0.25, 5, seed=1) array(['TVKJTE', 'ABOCORHFM', 'LUDMMGTB', 'KWOQNPHZ', 'VSXRRL'])
>>> ak.random_strings_lognormal(2, 0.25, 5, seed=1, characters='printable') array(['+5"fp-', ']3Q4kC~HF', '=F=`,IE!', 'DjkBa'9(', '5oZ1)='])
- arkouda.from_series(series: pandas.Series, dtype: type | str | None = None) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Converts a Pandas Series to an Arkouda pdarray or Strings object. If dtype is None, the dtype is inferred from the Pandas Series. Otherwise, the dtype parameter is set if the dtype of the Pandas Series is to be overridden or is unknown (for example, in situations where the Series dtype is object).
- Parameters:
series (Pandas Series) – The Pandas Series with a dtype of bool, float64, int64, or string
dtype (Optional[type]) – The valid dtype types are np.bool, np.float64, np.int64, and np.str
- Return type:
- Raises:
TypeError – Raised if series is not a Pandas Series object
ValueError – Raised if the Series dtype is not bool, float64, int64, string, datetime, or timedelta
Examples
>>> ak.from_series(pd.Series(np.random.randint(0,10,5))) array([9, 0, 4, 7, 9])
>>> ak.from_series(pd.Series(['1', '2', '3', '4', '5']),dtype=np.int64) array([1, 2, 3, 4, 5])
>>> ak.from_series(pd.Series(np.random.uniform(low=0.0,high=1.0,size=3))) array([0.57600036956445599, 0.41619265571741659, 0.6615356693784662])
>>> ak.from_series(pd.Series(['0.57600036956445599', '0.41619265571741659', '0.6615356693784662']), dtype=np.float64) array([0.57600036956445599, 0.41619265571741659, 0.6615356693784662])
>>> ak.from_series(pd.Series(np.random.choice([True, False],size=5))) array([True, False, True, True, True])
>>> ak.from_series(pd.Series(['True', 'False', 'False', 'True', 'True']), dtype=np.bool) array([True, True, True, True, True])
>>> ak.from_series(pd.Series(['a', 'b', 'c', 'd', 'e'], dtype="string")) array(['a', 'b', 'c', 'd', 'e'])
>>> ak.from_series(pd.Series(['a', 'b', 'c', 'd', 'e']),dtype=np.str) array(['a', 'b', 'c', 'd', 'e'])
>>> ak.from_series(pd.Series(pd.to_datetime(['1/1/2018', np.datetime64('2018-01-01')]))) array([1514764800000000000, 1514764800000000000])
Notes
The supported datatypes are bool, float64, int64, string, and datetime64[ns]. The data type is either inferred from the the Series or is set via the dtype parameter.
Series of datetime or timedelta are converted to Arkouda arrays of dtype int64 (nanoseconds)
A Pandas Series containing strings has a dtype of object. Arkouda assumes the Series contains strings and sets the dtype to str
- arkouda.bigint_from_uint_arrays(arrays, max_bits=-1)#
Create a bigint pdarray from an iterable of uint pdarrays. The first item in arrays will be the highest 64 bits and the last item will be the lowest 64 bits.
- Parameters:
arrays (Sequence[pdarray]) – An iterable of uint pdarrays used to construct the bigint pdarray. The first item in arrays will be the highest 64 bits and the last item will be the lowest 64 bits.
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
bigint pdarray constructed from uint arrays
- Return type:
- Raises:
TypeError – Raised if any pdarray in arrays has a dtype other than uint or if the pdarrays are not the same size.
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> a = ak.bigint_from_uint_arrays([ak.ones(5, dtype=ak.uint64), ak.arange(5, dtype=ak.uint64)]) >>> a array(["18446744073709551616" "18446744073709551617" "18446744073709551618" "18446744073709551619" "18446744073709551620"])
>>> a.dtype dtype(bigint)
>>> all(a[i] == 2**64 + i for i in range(5)) True
- arkouda.cast(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, dt: numpy.dtype | type | str | arkouda.dtypes.BigInt, errors: ErrorMode = ErrorMode.strict) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical | Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Cast an array to another dtype.
- Parameters:
dt (np.dtype, type, or str) – The target dtype to cast values to
errors ({strict, ignore, return_validity}) –
Controls how errors are handled when casting strings to a numeric type (ignored for casts from numeric types).
strict: raise RuntimeError if any string cannot be converted
- ignore: never raise an error. Uninterpretable strings get
converted to NaN (float64), -2**63 (int64), zero (uint64 and uint8), or False (bool)
return_validity: in addition to returning the same output as “ignore”, also return a bool array indicating where the cast was successful.
- Returns:
pdarray or Strings – Array of values cast to desired dtype
[validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is returned with True where the cast succeeded and False where it failed.
Notes
The cast is performed according to Chapel’s casting rules and is NOT safe from overflows or underflows. The user must ensure that the target dtype has the precision and capacity to hold the desired result.
Examples
>>> ak.cast(ak.linspace(1.0,5.0,5), dt=ak.int64) array([1, 2, 3, 4, 5])
>>> ak.cast(ak.arange(0,5), dt=ak.float64).dtype dtype('float64')
>>> ak.cast(ak.arange(0,5), dt=ak.bool) array([False, True, True, True, True])
>>> ak.cast(ak.linspace(0,4,5), dt=ak.bool) array([False, True, True, True, True])
- arkouda.abs(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise absolute value of the array.
- Parameters:
pda (pdarray) –
- Returns:
A pdarray containing absolute values of the input array elements
- Return type:
- Raises:
TypeError – Raised if the parameter is not a pdarray
Examples
>>> ak.abs(ak.arange(-5,-1)) array([5, 4, 3, 2])
>>> ak.abs(ak.linspace(-5,-1,5)) array([5, 4, 3, 2, 1])
- arkouda.log(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise natural log of the array.
- Parameters:
pda (pdarray) –
- Returns:
A pdarray containing natural log values of the input array elements
- Return type:
- Raises:
TypeError – Raised if the parameter is not a pdarray
Notes
Logarithms with other bases can be computed as follows:
Examples
>>> A = ak.array([1, 10, 100]) # Natural log >>> ak.log(A) array([0, 2.3025850929940459, 4.6051701859880918]) # Log base 10 >>> ak.log(A) / np.log(10) array([0, 1, 2]) # Log base 2 >>> ak.log(A) / np.log(2) array([0, 3.3219280948873626, 6.6438561897747253])
- arkouda.exp(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise exponential of the array.
- Parameters:
pda (pdarray) –
- Returns:
A pdarray containing exponential values of the input array elements
- Return type:
- Raises:
TypeError – Raised if the parameter is not a pdarray
Examples
>>> ak.exp(ak.arange(1,5)) array([2.7182818284590451, 7.3890560989306504, 20.085536923187668, 54.598150033144236])
>>> ak.exp(ak.uniform(5,1.0,5.0)) array([11.84010843172504, 46.454368507659211, 5.5571769623557188, 33.494295836924771, 13.478894913238722])
- arkouda.cumsum(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the cumulative sum over the array.
The sum is inclusive, such that the
ith element of the result is the sum of elements up to and includingi.- Parameters:
pda (pdarray) –
- Returns:
A pdarray containing cumulative sums for each element of the original pdarray
- Return type:
- Raises:
TypeError – Raised if the parameter is not a pdarray
Examples
>>> ak.cumsum(ak.arange([1,5])) array([1, 3, 6])
>>> ak.cumsum(ak.uniform(5,1.0,5.0)) array([3.1598310770203937, 5.4110385860243131, 9.1622479306453748, 12.710615785506533, 13.945880905466208])
>>> ak.cumsum(ak.randint(0, 1, 5, dtype=ak.bool)) array([0, 1, 1, 2, 3])
- arkouda.cumprod(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the cumulative product over the array.
The product is inclusive, such that the
ith element of the result is the product of elements up to and includingi.- Parameters:
pda (pdarray) –
- Returns:
A pdarray containing cumulative products for each element of the original pdarray
- Return type:
- Raises:
TypeError – Raised if the parameter is not a pdarray
Examples
>>> ak.cumprod(ak.arange(1,5)) array([1, 2, 6, 24]))
>>> ak.cumprod(ak.uniform(5,1.0,5.0)) array([1.5728783400481925, 7.0472855509390593, 33.78523998586553, 134.05309592737584, 450.21589865655358])
- arkouda.sin(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise sine of the array.
- arkouda.cos(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise cosine of the array.
- arkouda.tan(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise tangent of the array.
- arkouda.arcsin(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise inverse sine of the array. The result is between -pi/2 and pi/2.
- arkouda.arccos(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise inverse cosine of the array. The result is between 0 and pi.
- arkouda.arctan(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise inverse tangent of the array. The result is between -pi/2 and pi/2.
- arkouda.arctan2(num: arkouda.pdarrayclass.pdarray | arkouda.dtypes.numeric_scalars, denom: arkouda.pdarrayclass.pdarray | arkouda.dtypes.numeric_scalars) arkouda.pdarrayclass.pdarray#
Return the element-wise inverse tangent of the array pair. The result chosen is the signed angle in radians between the ray ending at the origin and passing through the point (1,0), and the ray ending at the origin and passing through the point (denom, num). The result is between -pi and pi.
- Parameters:
- Returns:
A pdarray containing inverse tangent for each corresponding element pair of the original pdarray, using the signed values or the numerator and denominator to get proper placement on unit circle.
- Return type:
- Raises:
TypeError – Raised if the parameter is not a pdarray
- arkouda.sinh(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise hyperbolic sine of the array.
- arkouda.cosh(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise hyperbolic cosine of the array.
- arkouda.tanh(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise hyperbolic tangent of the array.
- arkouda.arcsinh(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise inverse hyperbolic sine of the array.
- arkouda.arccosh(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise inverse hyperbolic cosine of the array.
- arkouda.arctanh(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise inverse hyperbolic tangent of the array.
- arkouda.rad2deg(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Converts angles element-wise from radians to degrees.
- arkouda.deg2rad(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Converts angles element-wise from degrees to radians.
- arkouda.hash(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | SegArray | Categorical | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | SegArray | Categorical], full: bool = True) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray] | arkouda.pdarrayclass.pdarray#
Return an element-wise hash of the array or list of arrays.
- Parameters:
pda (Union[pdarray, Strings, Segarray, Categorical],) – List[Union[pdarray, Strings, Segarray, Categorical]]]
full (bool) – This is only used when a single pdarray is passed into hash By default, a 128-bit hash is computed and returned as two int64 arrays. If full=False, then a 64-bit hash is computed and returned as a single int64 array.
- Returns:
If full=True or a list of pdarrays is passed, a 2-tuple of pdarrays containing the high and low 64 bits of each hash, respectively. If full=False and a single pdarray is passed, a single pdarray containing a 64-bit hash
- Return type:
hashes
- Raises:
TypeError – Raised if the parameter is not a pdarray
Notes
In the case of a single pdarray being passed, this function uses the SIPhash algorithm, which can output either a 64-bit or 128-bit hash. However, the 64-bit hash runs a significant risk of collisions when applied to more than a few million unique values. Unless the number of unique values is known to be small, the 128-bit hash is strongly recommended.
Note that this hash should not be used for security, or for any cryptographic application. Not only is SIPhash not intended for such uses, but this implementation employs a fixed key for the hash, which makes it possible for an adversary with control over input to engineer collisions.
In the case of a list of pdrrays, Strings, Categoricals, or Segarrays being passed, a non-linear function must be applied to each array since hashes of subsequent arrays cannot be simply XORed because equivalent values will cancel each other out, hence we do a rotation by the ordinal of the array.
- arkouda.where(condition: arkouda.pdarrayclass.pdarray, A: str | arkouda.dtypes.numeric_scalars | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, B: str | arkouda.dtypes.numeric_scalars | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical#
Returns an array with elements chosen from A and B based upon a conditioning array. As is the case with numpy.where, the return array consists of values from the first array (A) where the conditioning array elements are True and from the second array (B) where the conditioning array elements are False.
- Parameters:
condition (pdarray) – Used to choose values from A or B
A (Union[numeric_scalars, str, pdarray, Strings, Categorical]) – Value(s) used when condition is True
B (Union[numeric_scalars, str, pdarray, Strings, Categorical]) – Value(s) used when condition is False
- Returns:
Values chosen from A where the condition is True and B where the condition is False
- Return type:
- Raises:
TypeError – Raised if the condition object is not a pdarray, if A or B is not an int, np.int64, float, np.float64, pdarray, str, Strings, Categorical if pdarray dtypes are not supported or do not match, or multiple condition clauses (see Notes section) are applied
ValueError – Raised if the shapes of the condition, A, and B pdarrays are unequal
Examples
>>> a1 = ak.arange(1,10) >>> a2 = ak.ones(9, dtype=np.int64) >>> cond = a1 < 5 >>> ak.where(cond,a1,a2) array([1, 2, 3, 4, 1, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10) >>> a2 = ak.ones(9, dtype=np.int64) >>> cond = a1 == 5 >>> ak.where(cond,a1,a2) array([1, 1, 1, 1, 5, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10) >>> a2 = 10 >>> cond = a1 < 5 >>> ak.where(cond,a1,a2) array([1, 2, 3, 4, 10, 10, 10, 10, 10])
>>> s1 = ak.array([f'str {i}' for i in range(10)]) >>> s2 = 'str 21' >>> cond = (ak.arange(10) % 2 == 0) >>> ak.where(cond,s1,s2) array(['str 0', 'str 21', 'str 2', 'str 21', 'str 4', 'str 21', 'str 6', 'str 21', 'str 8','str 21'])
>>> c1 = ak.Categorical(ak.array([f'str {i}' for i in range(10)])) >>> c2 = ak.Categorical(ak.array([f'str {i}' for i in range(9, -1, -1)])) >>> cond = (ak.arange(10) % 2 == 0) >>> ak.where(cond,c1,c2) array(['str 0', 'str 8', 'str 2', 'str 6', 'str 4', 'str 4', 'str 6', 'str 2', 'str 8', 'str 0'])
Notes
A and B must have the same dtype and only one conditional clause is supported e.g., n < 5, n > 1, which is supported in numpy is not currently supported in Arkouda
- arkouda.histogram(pda: arkouda.pdarrayclass.pdarray, bins: arkouda.dtypes.int_scalars = 10) Tuple[numpy.ndarray, arkouda.pdarrayclass.pdarray]#
Compute a histogram of evenly spaced bins over the range of an array.
- Parameters:
pda (pdarray) – The values to histogram
bins (int_scalars) – The number of equal-size bins to use (default: 10)
- Returns:
Bin edges and The number of values present in each bin
- Return type:
(np.ndarray, Union[pdarray, int64 or float64])
- Raises:
TypeError – Raised if the parameter is not a pdarray or if bins is not an int.
ValueError – Raised if bins < 1
NotImplementedError – Raised if pdarray dtype is bool or uint8
See also
Notes
The bins are evenly spaced in the interval [pda.min(), pda.max()].
Examples
>>> import matplotlib.pyplot as plt >>> A = ak.arange(0, 10, 1) >>> nbins = 3 >>> b, h = ak.histogram(A, bins=nbins) >>> h array([3, 3, 4]) >>> b array([0., 3., 6.])
# To plot, use only the left edges (now returned), and export the histogram to NumPy >>> plt.plot(b, h.to_ndarray())
- arkouda.value_counts(pda: arkouda.pdarrayclass.pdarray) Categorical | Tuple[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, arkouda.pdarrayclass.pdarray | None]#
Count the occurrences of the unique values of an array.
- Parameters:
pda (pdarray, int64) – The array of values to count
- Returns:
unique_values (pdarray, int64 or Strings) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs
- Raises:
TypeError – Raised if the parameter is not a pdarray
Notes
This function differs from
histogram()in that it only returns counts for values that are present, leaving out empty “bins”. This function delegates all logic to the unique() method where the return_counts parameter is set to True.Examples
>>> A = ak.array([2, 0, 2, 4, 0, 0]) >>> ak.value_counts(A) (array([0, 2, 4]), array([3, 2, 1]))
- arkouda.isnan(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Test a pdarray for Not a number / NaN values Currently only supports float-value-based arrays
- Parameters:
pda (pdarray to test) –
- Return type:
pdarray consisting of True / False values; True where NaN, False otherwise
- Raises:
TypeError – Raised if the parameter is not a pdarray
RuntimeError – if the underlying pdarray is not float-based
- class arkouda.ErrorMode#
Bases:
enum.EnumGeneric enumeration.
Derive from this class to define new enumerations.
- strict = 'strict'#
- ignore = 'ignore'#
- return_validity = 'return_validity'#
- arkouda.unique(pda: groupable, return_groups: bool = False, assume_sorted: bool = False, return_indices: bool = False) groupable | Tuple[groupable, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, int]#
Find the unique elements of an array.
Returns the unique elements of an array, sorted if the values are integers. There is an optional output in addition to the unique elements: the number of times each unique value comes up in the input array.
- Parameters:
pda ((list of) pdarray, Strings, or Categorical) – Input array.
return_groups (bool, optional) – If True, also return grouping information for the array.
return_indices (bool, optional) – Only applicable if return_groups is True. If True, return unique key indices along with other groups
assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step
- Returns:
unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.
permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)
segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)
- Raises:
TypeError – Raised if pda is not a pdarray or Strings object
RuntimeError – Raised if the pdarray or Strings dtype is unsupported
Notes
For integer arrays, this function checks to see whether pda is sorted and, if so, whether it is already unique. This step can save considerable computation. Otherwise, this function will sort pda.
Examples
>>> A = ak.array([3, 2, 1, 1, 2, 3]) >>> ak.unique(A) array([1, 2, 3])
- class arkouda.GroupBy(keys: groupable | None = None, assume_sorted: bool = False, **kwargs)#
Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.
- Parameters:
keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row
assume_sorted (bool) – If True, assume keys is already sorted (Default: False)
- nkeys#
The number of key arrays (columns)
- Type:
int
- size#
The length of the input array(s), i.e. number of rows
- Type:
int
- unique_keys#
The unique values of the keys array(s), in grouped order
- Type:
(list of) pdarray, Strings, or Categorical
- ngroups#
The length of the unique_keys array(s), i.e. number of groups
- Type:
int
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
- Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that groups the array
If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.
- Reductions#
- objType = 'GroupBy'#
- static from_return_msg(rep_msg)#
- to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')#
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Returns:
None
GroupBy is not currently supported by Parquet
- update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)#
- size() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
See also
Notes
This alias for “count” was added to conform with Pandas API
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.size() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- count() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- aggregate(values: groupable, operator: str, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, groupable]#
Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.
- Parameters:
values (pdarray) – The values to group and reduce
operator (str) – The name of the reduction operator to use
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
aggregates (groupable) – One aggregate value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if the requested operator is not supported for the values dtype
Examples
>>> keys = ak.arange(0, 10) >>> vals = ak.linspace(-1, 1, 10) >>> g = ak.GroupBy(keys) >>> g.aggregate(vals, 'sum') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768, -0.55555555555555536, -0.33333333333333348, -0.11111111111111116, 0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768, 1])) >>> g.aggregate(vals, 'min') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779, -0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116, 0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
- sum(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.
- Parameters:
values (pdarray) – The values to group and sum
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_sums (pdarray) – One sum per unique key in the GroupBy instance
skipna (bool) – boolean which determines if NANs should be skipped
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The grouped sum of a boolean
pdarrayreturns integers.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.sum(b) (array([2, 3, 4]), array([8, 14, 6]))
- prod(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.
- Parameters:
values (pdarray) – The values to group and multiply
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_products (pdarray, float64) – One product per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if prod is not supported for the values dtype
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.prod(b) (array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
- var(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.
- Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean, i.e.,
var = mean((x - x.mean())**2).The mean is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of a hypothetical infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.var(b) (array([2 3 4]), array([2.333333333333333 1.2 0]))
- std(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.
- Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,
std = sqrt(mean((x - x.mean())**2)).The average squared deviation is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of the infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1, it will not be an unbiased estimate of the standard deviation per se.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.std(b) (array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
- mean(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.
- Parameters:
values (pdarray) – The values to group and average
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.mean(b) (array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
- median(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.
- Parameters:
values (pdarray) – The values to group and find median
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,9) >>> a array([4 1 4 3 2 2 2 3 3]) >>> g = ak.GroupBy(a) >>> g.keys array([4 1 4 3 2 2 2 3 3]) >>> b = ak.linspace(-5,5,9) >>> b array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5]) >>> g.median(b) (array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
- min(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find minima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_minima (pdarray) – One minimum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if min is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.min(b) (array([2, 3, 4]), array([1, 1, 3]))
- max(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find maxima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_maxima (pdarray) – One maximum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if max is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.max(b) (array([2, 3, 4]), array([4, 4, 3]))
- argmin(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmin
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmin(b) (array([2, 3, 4]), array([5, 4, 2]))
- argmax(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmax
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmax(b) (array([2, 3, 4]), array([9, 3, 2]))
- nunique(values: groupable) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.
- Parameters:
values (pdarray, int64) – The values to group and find unique values
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if nunique is not supported for the values dtype
Examples
>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> data array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> labels ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g = ak.GroupBy(labels) >>> g.keys ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g.nunique(data) array([1,2,3,4]), array([2, 2, 3, 1]) # Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4 # Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4 # Group (3,3,3) has values [3,4,1] -> 3 unique values # Group (4) has values [4] -> 1 unique value
- any(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “or”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
- all(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “and”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- OR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise OR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with OR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- AND(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise AND of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with AND
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- XOR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise XOR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with XOR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- first(values: groupable_element_type) Tuple[groupable, groupable_element_type]#
First value in each group.
- Parameters:
values (pdarray-like) – The values from which to take the first of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first value of each group
- mode(values: groupable) Tuple[groupable, groupable]#
Most common value in each group. If a group is multi-modal, return the modal value that occurs first.
- Parameters:
values ((list of) pdarray-like) – The values from which to take the mode of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) pdarray-like) – The most common value of each group
- unique(values: groupable)#
Return the set of unique values in each group, as a SegArray.
- Parameters:
values ((list of) pdarray-like) – The values to unique
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) SegArray) – The unique values of each group
- Raises:
TypeError – Raised if values is or contains Strings or Categorical
- broadcast(values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, permute: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Fill each group’s segment with a constant value.
- Parameters:
- Returns:
The broadcasted values
- Return type:
- Raises:
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one value per segment
Notes
This function is a sparse analog of
np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.Examples
>>> a = ak.array([0, 1, 0, 1, 0]) >>> values = ak.array([3, 5]) >>> g = ak.GroupBy(a) # By default, result is in original order >>> g.broadcast(values) array([3, 5, 3, 5, 3]) # With permute=False, result is in grouped order >>> g.broadcast(values, permute=False) array([3, 3, 3, 5, 5] >>> a = ak.randint(1,5,10) >>> a array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> g.broadcast(counts > 2) array([True False True True True False True True False False]) >>> g.broadcast(counts == 3) array([True False True True True False True True False False]) >>> g.broadcast(counts < 4) array([True True True True True True True True True True])
- static build_from_components(user_defined_name: str = None, **kwargs) GroupBy#
function to build a new GroupBy object from component keys and permutation.
- Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
- Returns:
The GroupBy object created by using the given components
- Return type:
- register(user_defined_name: str) GroupBy#
Register this GroupBy object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components
- Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the GroupBy with the user_defined_name
See also
unregister,attach,unregister_groupby_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() bool#
Return True if the object is contained in the registry
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static attach(user_defined_name: str) GroupBy#
Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which GroupBy object was registered under
- Returns:
The GroupBy object created by re-attaching to the corresponding server components
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
register,is_registered,unregister,unregister_groupby_by_name
- static unregister_groupby_by_name(user_defined_name: str) None#
Function to unregister GroupBy object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the GroupBy object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- most_common(values)#
(Deprecated) See GroupBy.mode().
- arkouda.broadcast(segments: arkouda.pdarrayclass.pdarray, values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, size: int | numpy.int64 | numpy.uint64 = -1, permutation: arkouda.pdarrayclass.pdarray | None = None)#
Broadcast a dense column vector to the rows of a sparse matrix or grouped array.
- Parameters:
segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.
values (pdarray, Strings) – The values to broadcast, one per row (or group)
size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.
permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.
- Returns:
The broadcast values, one per nonzero
- Return type:
- Raises:
ValueError –
If segments and values are different sizes
If segments are empty
If number of nonzeros (either user-specified or inferred from permutation) is less than one
Examples
>>> # Define a sparse matrix with 3 rows and 7 nonzeros >>> row_starts = ak.array([0, 2, 5]) >>> nnz = 7 # Broadcast the row number to each nonzero element >>> row_number = ak.arange(3) >>> ak.broadcast(row_starts, row_number, nnz) array([0 0 1 1 1 2 2]) # If the original nonzeros were in reverse order... >>> permutation = ak.arange(6, -1, -1) >>> ak.broadcast(row_starts, row_number, permutation=permutation) array([2 2 1 1 1 0 0])
- arkouda.GROUPBY_REDUCTION_TYPES#
- class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.dtypes.int_scalars)#
Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.
- entry#
Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of
offsets array: starting indices for each string
bytes array: raw bytes of all strings joined by nulls
- Type:
- size#
The number of strings in the array
- Type:
int_scalars
- nbytes#
The total number of bytes in all strings
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
The sizes of each dimension of the array
- Type:
tuple
- dtype#
The dtype is ak.str
- Type:
dtype
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
Notes
Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.
- BinOps#
- objType = 'Strings'#
- static from_return_msg(rep_msg: str) Strings#
Factory method for creating a Strings object from an Arkouda server response message
- Parameters:
rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.
- static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings#
Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.
- Parameters:
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.
- get_lengths() arkouda.pdarrayclass.pdarray#
Return the length of each string in the array.
- Returns:
The length of each string
- Return type:
pdarray, int
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- get_bytes()#
Getter for the bytes component (uint8 pdarray) of this Strings.
- Returns:
Pdarray of bytes of the string accessed
- Return type:
pdarray, uint8
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_bytes() [111 110 101 0 116 119 111 0 116 104 114 101 101 0]
- get_offsets()#
Getter for the offsets component (int64 pdarray) of this Strings.
- Returns:
Pdarray of offsets of the string accessed
- Return type:
pdarray, int64
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_offsets() [0 4 8]
- encode(toEncoding: str, fromEncoding: str = 'UTF-8')#
Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding
- Parameters:
toEncoding (str) – The encoding that the strings will be converted to
fromEncoding (str) – The current encoding of the strings object, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- decode(fromEncoding, toEncoding='UTF-8')#
Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding
- Parameters:
fromEncoding (str) – The current encoding of the strings object
toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- to_lower() Strings#
Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Returns:
Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_lower() array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
- to_upper() Strings#
Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Returns:
Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_upper() array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
- to_title() Strings#
Returns a new Strings from the original replaced with their titlecase equivalent
- Returns:
Strings from the original replaced with their titlecase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Strings.to_lower,String.to_upperExamples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_title() array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
- is_lower() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase
- Returns:
True for elements that are entirely lowercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_lower() array([True True True False False False])
- is_upper() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase
- Returns:
True for elements that are entirely uppercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_upper() array([False False False True True True])
- is_title() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase
- Returns:
True for elements that are titlecase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)]) >>> title = ak.array([f'Strings {i}' for i in range(3)]) >>> strings = ak.concatenate([mixed, title]) >>> strings array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2']) >>> strings.is_title() array([False False False True True True])
- strip(chars: bytes | arkouda.dtypes.str_scalars | None = '') Strings#
Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
- Parameters:
chars – the set of characters to be removed
- Returns:
Strings object with the leading and trailing characters matching the set of characters in the chars argument removed
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> strings = ak.array(['Strings ', ' StringS ', 'StringS ']) >>> s = strings.strip() >>> s array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS ', ' 1StringS 12 ']) >>> s = strings.strip(' 12') >>> s array(['Strings', 'StringS', 'StringS'])
- cached_regex_patterns() List#
Returns the regex patterns for which Match objects have been cached
- purge_cached_regex_patterns() None#
purges cached regex patterns
- find_locations(pattern: bytes | arkouda.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches
- Parameters:
pattern (str_scalars) – The regex pattern used to find matches
- Returns:
pdarray, int64 – For each original string, the number of pattern matches
pdarray, int64 – The start positons of pattern matches
pdarray, int64 – The lengths of pattern matches
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> num_matches, starts, lens = strings.find_locations('\d') >>> num_matches array([2, 2, 2, 2, 2]) >>> starts array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9]) >>> lens array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
- search(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match if any part of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+') <ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- match(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the beginning of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the beginning of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.match('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- fullmatch(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the whole string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the whole string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.fullmatch('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=False; matched=False>
- split(pattern: bytes | arkouda.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple#
Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur
- Parameters:
pattern (str) – Regex used to split strings into substrings
maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences
return_segments (bool) – If True, return mapping of original strings to first substring in return array.
- Returns:
Strings – Substrings with pattern matches removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.split('_+', maxsplit=2, return_segments=True) (array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
- findall(pattern: bytes | arkouda.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple#
Return a new Strings containg all non-overlapping matches of pattern
- Parameters:
pattern (str_scalars) – Regex used to find matches
return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from
- Returns:
Strings – Strings object containing only pattern matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.findall('_+', return_match_origins=True) (array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
- sub(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Strings#
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings with pattern matches replaced
- Return type:
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.sub(pattern='_+', repl='-', count=2) array(['1-2-', '-', '3', '-4-5____6___7', ''])
- subn(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Tuple#
Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings – Strings with pattern matches replaced
pdarray, int64 – The number of substitutions made for each element of Strings
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.subn(pattern='_+', repl='-', count=2) (array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
- contains(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element contains the given substring.
- Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> strings array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5']) >>> strings.contains('string') array([True, True, True, True, True]) >>> strings.contains('string \d', regex=True) array([True, True, True, True, True])
- startswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The prefix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not a bytes ior str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.startswith('string') array([True, True, True, True, True]) >>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.startswith('\d str', regex = True) array([True, True, True, True, True])
- endswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The suffix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.endswith('ing') array([True, True, True, True, True]) >>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.endswith('ing \d', regex = True) array([True, True, True, True, True])
- flatten(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple#
Unpack delimiter-joined substrings into a flat array.
- Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring in return array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> orig = ak.array(['one|two', 'three|four|five', 'six']) >>> orig.flatten('|') array(['one', 'two', 'three', 'four', 'five', 'six']) >>> flat, map = orig.flatten('|', return_segments=True) >>> map array([0, 2, 5]) >>> under = ak.array(['one_two', 'three_____four____five', 'six']) >>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True) >>> under_flat array(['one', 'two', 'three', 'four', 'five', 'six']) >>> under_map array([0, 2, 5])
- peel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple#
Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The field(s) peeled from the end of each string (unless fromRight is true)
- right: Strings
The remainder of each string after peeling (unless fromRight is true)
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g'])) >>> s.peel('.', includeDelimiter=True) (array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g'])) >>> s.peel('.', times=2) (array(['', '', 'e.f']), array(['a.b', 'c.d', 'g'])) >>> s.peel('.', times=2, keepPartial=True) (array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
- rpeel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)#
Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The remainder of the string after peeling
- right: Strings
The field(s) that were peeled from the right of each string
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.rpeel('.') (array(['a', 'c', 'e.f']), array(['b', 'd', 'g'])) # Compared against peel >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
- stick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '', toLeft: bool = False) Strings#
Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.
- Returns:
The array of joined strings
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance
ValueError – Raised if times is < 1
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.stick(t, delimiter='.') array(['a.b', 'c.d', 'e.f'])
- lstick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '') Strings#
Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
- Returns:
The array of joined strings, as other + self
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.lstick(t, delimiter='.') array(['b.a', 'd.c', 'f.e'])
- get_prefixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long prefix of each string, where possible
- Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.
- Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.
- get_suffixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long suffix of each string, where possible
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.
- Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Compute a 128-bit hash of each string.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- group() arkouda.pdarrayclass.pdarray#
Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
Notes
If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.
- Raises:
RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same strings as this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_ndarray() array(['hello', 'my', 'world'], dtype='<U5') >>> type(a.to_ndarray()) numpy.ndarray
- to_list() list#
Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list with the same strings as this SegString
- Return type:
list
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_list() ['hello', 'my', 'world'] >>> type(a.to_list()) list
- astype(dtype) arkouda.pdarrayclass.pdarray#
Cast values of Strings object to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str#
Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str#
Save the Strings object to HDF5. The object can be saved to a collection of files or single file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must have write permission.
Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path.If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a
RuntimeErrorwill result.Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
See also
- update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)#
Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)#
Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
str reponse message
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n) at this time.
- save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
- Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- register(user_defined_name: str) Strings#
Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.
- Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
- Returns:
The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- unregister() None#
Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- static attach(user_defined_name: str) Strings#
class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which the Strings object was registered under
- Returns:
the Strings object registered with user_defined_name in the arkouda server
- Return type:
Strings object
- Raises:
TypeError – Raised if user_defined_name is not a str
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- static unregister_strings_by_name(user_defined_name: str) None#
Unregister a Strings object in the arkouda server previously registered via register()
- Parameters:
user_defined_name (str) – The registered name of the Strings object
See also
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a Strings object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.join_on_eq_with_dt(a1: arkouda.pdarrayclass.pdarray, a2: arkouda.pdarrayclass.pdarray, t1: arkouda.pdarrayclass.pdarray, t2: arkouda.pdarrayclass.pdarray, dt: int | numpy.int64, pred: str, result_limit: int | numpy.int64 = 1000) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Performs an inner-join on equality between two integer arrays where the time-window predicate is also true
- Parameters:
a1 (pdarray, int64) – pdarray to be joined
a2 (pdarray, int64) – pdarray to be joined
t1 (pdarray) – timestamps in millis corresponding to the a1 pdarray
t2 (pdarray) – timestamps in millis corresponding to the a2 pdarray
dt (Union[int,np.int64]) – time delta
pred (str) – time window predicate
result_limit (Union[int,np.int64]) – size limit for returned result
- Returns:
result_array_one (pdarray, int64) – a1 indices where a1 == a2
result_array_one (pdarray, int64) – a2 indices where a2 == a1
- Raises:
TypeError – Raised if a1, a2, t1, or t2 is not a pdarray, or if dt or result_limit is not an int
ValueError – if a1, a2, t1, or t2 dtype is not int64, pred is not ‘true_dt’, ‘abs_dt’, or ‘pos_dt’, or result_limit is < 0
- arkouda.gen_ranges(starts, ends, stride=1)#
Generate a segmented array of variable-length, contiguous ranges between pairs of start- and end-points.
- Parameters:
- Returns:
segments (pdarray, int64) – The starting index of each range in the resulting array
ranges (pdarray, int64) – The actual ranges, flattened into a single array
- arkouda.compute_join_size(a: arkouda.pdarrayclass.pdarray, b: arkouda.pdarrayclass.pdarray) Tuple[int, int]#
Compute the internal size of a hypothetical join between a and b. Returns both the number of elements and number of bytes required for the join.
- class arkouda.Categorical(values, **kwargs)#
Represents an array of values belonging to named categories. Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.
- Parameters:
values (Strings) – String values to convert to categories
NAvalue (str scalar) – The value to use to represent missing/null data
- permutation#
The permutation that groups the values in the same order as categories
- Type:
pdarray, int64
- size#
The number of items in the array
- Type:
Union[int,np.int64]
- nlevels#
The number of distinct categories
- Type:
Union[int,np.int64]
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
Union[int,np.int64]
- shape#
The sizes of each dimension of the array
- Type:
tuple
- BinOps#
- RegisterablePieces#
- RequiredPieces#
- permutation#
- segments#
- objType = 'Categorical'#
- dtype#
- classmethod from_codes(codes: arkouda.pdarrayclass.pdarray, categories: arkouda.strings.Strings, permutation=None, segments=None, **kwargs) Categorical#
Make a Categorical from codes and categories arrays. If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.
- Parameters:
- Returns:
The Categorical object created from the input parameters
- Return type:
- Raises:
TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object
- classmethod from_return_msg(rep_msg) Categorical#
Create categorical from return message from server
Notes
This is currently only used when reading a Categorical from HDF5 files.
- classmethod standardize_categories(arrays, NAvalue='N/A')#
Standardize an array of Categoricals so that they share the same categories.
- Parameters:
arrays (sequence of Categoricals) – The Categoricals to standardize
NAvalue (str scalar) – The value to use to represent missing/null data
- Returns:
A list of the original Categoricals remapped to the shared categories.
- Return type:
List of Categoricals
- set_categories(new_categories, NAvalue=None)#
Set categories to user-defined values.
- Parameters:
new_categories (Strings) – The array of new categories to use. Must be unique.
NAvalue (str scalar) – The value to use to represent missing/null data
- Returns:
A new Categorical with the user-defined categories. Old values present in new categories will appear unchanged. Old values not present will be assigned the NA value.
- Return type:
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray of strings corresponding to the values in this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.
- to_list() List#
Convert the Categorical to a list, transferring data from the arkouda server to Python. This conversion discards category information and produces a list of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list of strings corresponding to the values in this Categorical
- Return type:
list
Notes
The number of bytes in the Categorical cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.
- isna()#
Find where values are missing or null (as defined by self.NAvalue)
- reset_categories() Categorical#
Recompute the category labels, discarding any unused labels. This method is often useful after slicing or indexing a Categorical array, when the resulting array only contains a subset of the original categories. In this case, eliminating unused categories can speed up other operations.
- Returns:
A Categorical object generated from the current instance
- Return type:
- contains(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element contains the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- startswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- endswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- in1d(test: arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray#
Test whether each element of the Categorical object is also present in the test Strings or Categorical object.
Returns a boolean array the same length as self that is True where an element of self is in test and False otherwise.
- Parameters:
test (Union[Strings,Categorical]) – The values against which to test each value of ‘self`.
- Returns:
The values self[in1d] are in the test Strings or Categorical object.
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if test is not a Strings or Categorical object
See also
Notes
in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences.
in1d(a, b)is logically equivalent toak.array([item in b for item in a]), but is much faster and scales to arbitrarily largea.Examples
>>> strings = ak.array([f'String {i}' for i in range(0,5)]) >>> cat = ak.Categorical(strings) >>> ak.in1d(cat,strings) array([True, True, True, True, True]) >>> strings = ak.array([f'String {i}' for i in range(5,9)]) >>> catTwo = ak.Categorical(strings) >>> ak.in1d(cat,catTwo) array([False, False, False, False, False])
- unique() Categorical#
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Compute a 128-bit hash of each element of the Categorical.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- group() arkouda.pdarrayclass.pdarray#
Return the permutation that groups the array, placing equivalent categories together. All instances of the same category are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
Notes
This method is faster than the corresponding Strings method. If the Categorical was created from a Strings object, then this function simply returns the cached permutation. Even if the Categorical was created using from_codes(), this function will be faster than Strings.group() because it sorts dense integer values, rather than 128-bit hash values.
- argsort()#
- sort()#
- concatenate(others: Sequence[Categorical], ordered: bool = True) Categorical#
Merge this Categorical with other Categorical objects in the array, concatenating the arrays and synchronizing the categories.
- Parameters:
others (Sequence[Categorical]) – The Categorical arrays to concatenate and merge with this one
ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.
- Returns:
The merged Categorical object
- Return type:
- Raises:
TypeError – Raised if any others array objects are not Categorical objects
Notes
This operation can be expensive – slower than concatenating Strings.
- to_hdf(prefix_path, dataset='categorical_array', mode='truncate', file_type='distribute')#
Save the Categorical to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale.
- Return type:
None
See also
- update_hdf(prefix_path, dataset='categorical_array', repack=True)#
Overwrite the dataset with the name provided with this Categorical object. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Categorical
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, the repack option allows for automatic creation of a file without the inaccessible data.
- to_parquet(prefix_path: str, dataset: str = 'categorical_array', mode: str = 'truncate', compression: str | None = None) str#
This functionality is currently not supported and will also raise a RuntimeError. Support is in development. Save the Categorical to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in HDF5 files (must not already exist)
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Categorical dataset within existing files.
compression (str (Optional)) – Default None Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – On run due to compatability issues of Categorical with Parquet.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.See also
- save(prefix_path: str, dataset: str = 'categorical_array', file_format: str = 'HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) str#
DEPRECATED Save the Categorical object to HDF5 or Parquet. The result is a collection of HDF5/Parquet files, one file per locale of the arkouda server, where each filename starts with prefix_path and dataset. Each locale saves its chunk of the Strings array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in HDF5 files (must not already exist) :type dataset: str :param file_format: The format to save the file to. :type file_format: str {‘HDF5 | ‘Parquet’} :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Categorical dataset within existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
compression (str (Optional)) – {None | ‘snappy’ | ‘gzip’ | ‘brotli’ | ‘zstd’ | ‘lz4’} The compression type to use when writing. This is only supported for Parquet files and will not be used with HDF5.
- Return type:
String message indicating result of save operation
- Raises:
ValueError – Raised if the lengths of columns and values differ, or the mode is neither ‘truncate’ nor ‘append’
TypeError – Raised if prefix_path, dataset, or mode is not a str
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter.
See also
-,-
- register(user_defined_name: str) Categorical#
Register this Categorical object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Categorical is to be registered under, this will be the root name for underlying components
- Returns:
The same Categorical which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Categoricals with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Categorical with the user_defined_name
See also
unregister,attach,unregister_categorical_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister() None#
Unregister this Categorical object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
register,attach,unregister_categorical_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
register,attach,unregister,unregister_categorical_by_nameNotes
Objects registered with the server are immune to deletion until they are unregistered.
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- static attach(user_defined_name: str) Categorical#
DEPRECATED Function to return a Categorical object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which Categorical object was registered under
- Returns:
The Categorical object created by re-attaching to the corresponding server components
- Return type:
- Raises:
TypeError – if user_defined_name is not a string
- static unregister_categorical_by_name(user_defined_name: str) None#
Function to unregister Categorical object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the Categorical object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- static parse_hdf_categoricals(d: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings]) Tuple[List[str], Dict[str, Categorical]]#
This function should be used in conjunction with the load_all function which reads hdf5 files and reconstitutes Categorical objects. Categorical objects use a naming convention and HDF5 structure so they can be identified and constructed for the user.
In general you should not call this method directly
- Parameters:
d (Dictionary of String to either Pdarray or Strings object) –
- Returns:
2-Tuple of List of strings containing key names which should be removed and Dictionary of
base name to Categorical object
See also
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a Categorical object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Categorical is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- class arkouda.LogLevel#
Bases:
enum.EnumGeneric enumeration.
Derive from this class to define new enumerations.
- DEBUG = 'DEBUG'#
- CRITICAL = 'CRITICAL'#
- INFO = 'INFO'#
- WARN = 'WARN'#
- ERROR = 'ERROR'#
- arkouda.enableVerbose() None#
Enables verbose logging (DEBUG log level) for all ArkoudaLoggers
- arkouda.disableVerbose(logLevel: LogLevel = LogLevel.INFO) None#
Disables verbose logging (DEBUG log level) for all ArkoudaLoggers, setting the log level for each to the logLevel parameter
- Parameters:
logLevel (LogLevel) – The new log level, defaultts to LogLevel.INFO
- Raises:
TypeError – Raised if logLevel is not a LogLevel enum
- arkouda.write_log(log_msg: str, tag: str = 'ClientGeneratedLog', log_lvl: LogLevel = LogLevel.INFO)#
Allows the user to write custom logs.
- Parameters:
log_msg (str) – The message to be added to the server log
tag (str) – The tag to use in the log. This takes the place of the server function name. Allows for easy identification of custom logs. Defaults to “ClientGeneratedLog”
log_lvl (LogLevel) – The type of log to be written Defaults to LogLevel.INFO
See also
- arkouda.int64#
- arkouda.int_scalars#
- arkouda.intTypes#
- arkouda.isSupportedInt(num)#
- arkouda.akabs(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the element-wise absolute value of the array.
- Parameters:
pda (pdarray) –
- Returns:
A pdarray containing absolute values of the input array elements
- Return type:
- Raises:
TypeError – Raised if the parameter is not a pdarray
Examples
>>> ak.abs(ak.arange(-5,-1)) array([5, 4, 3, 2])
>>> ak.abs(ak.linspace(-5,-1,5)) array([5, 4, 3, 2, 1])
- arkouda.cast(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, dt: numpy.dtype | type | str | arkouda.dtypes.BigInt, errors: ErrorMode = ErrorMode.strict) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical | Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Cast an array to another dtype.
- Parameters:
dt (np.dtype, type, or str) – The target dtype to cast values to
errors ({strict, ignore, return_validity}) –
Controls how errors are handled when casting strings to a numeric type (ignored for casts from numeric types).
strict: raise RuntimeError if any string cannot be converted
- ignore: never raise an error. Uninterpretable strings get
converted to NaN (float64), -2**63 (int64), zero (uint64 and uint8), or False (bool)
return_validity: in addition to returning the same output as “ignore”, also return a bool array indicating where the cast was successful.
- Returns:
pdarray or Strings – Array of values cast to desired dtype
[validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is returned with True where the cast succeeded and False where it failed.
Notes
The cast is performed according to Chapel’s casting rules and is NOT safe from overflows or underflows. The user must ensure that the target dtype has the precision and capacity to hold the desired result.
Examples
>>> ak.cast(ak.linspace(1.0,5.0,5), dt=ak.int64) array([1, 2, 3, 4, 5])
>>> ak.cast(ak.arange(0,5), dt=ak.float64).dtype dtype('float64')
>>> ak.cast(ak.arange(0,5), dt=ak.bool) array([False, True, True, True, True])
>>> ak.cast(ak.linspace(0,4,5), dt=ak.bool) array([False, True, True, True, True])
- exception arkouda.RegistrationError#
Bases:
ExceptionError/Exception used when the Arkouda Server cannot register an object
- arkouda.create_pdarray(repMsg: str, max_bits=None) pdarray#
Return a pdarray instance pointing to an array created by the arkouda server. The user should not call this function directly.
- Parameters:
repMsg (str) – space-delimited string containing the pdarray name, datatype, size dimension, shape,and itemsize
- Returns:
A pdarray with the same attributes and data as the pdarray; on GPU
- Return type:
- Raises:
ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance
RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance
- class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.dtypes.int_scalars, ndim: arkouda.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.dtypes.int_scalars, max_bits: int | None = None)#
The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.
- name#
The server-side identifier for the array
- Type:
str
- dtype#
The element type of the array
- Type:
dtype
- size#
The number of elements in the array
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
A list or tuple containing the sizes of each dimension of the array
- Type:
Sequence[int]
- itemsize#
The size in bytes of each element
- Type:
int_scalars
- property max_bits#
- BinOps#
- OpEqOps#
- objType = 'pdarray'#
- format_other(other) str#
Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.
- Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
- Return type:
string representation of np.dtype corresponding to the other parameter
- Raises:
TypeError – Raised if the other parameter cannot be converted to Numpy dtype
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a pdarray to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- opeq(other, op)#
- fill(value: arkouda.dtypes.numeric_scalars) None#
Fill the array (in place) with a constant value.
- Parameters:
value (numeric_scalars) –
- Raises:
TypeError – Raised if value is not an int, int64, float, or float64
- any() numpy.bool_#
Return True iff any element of the array evaluates to True.
- all() numpy.bool_#
Return True iff all elements of the array evaluate to True.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
Note
This will return True if the object is registered itself or as a component of another object
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- is_sorted() numpy.bool_#
Return True iff the array is monotonically non-decreasing.
- Parameters:
None –
- Returns:
Indicates if the array is monotonically non-decreasing
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- sum() arkouda.dtypes.numeric_and_bool_scalars#
Return the sum of all elements in the array.
- prod() numpy.float64#
Return the product of all elements in the array. Return value is always a np.float64 or np.int64.
- min() arkouda.dtypes.numpy_scalars#
Return the minimum value of the array.
- max() arkouda.dtypes.numpy_scalars#
Return the maximum value of the array.
- argmin() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array min value
- argmax() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array max value.
- mean() numpy.float64#
Return the mean of the array.
- var(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the variance. See
arkouda.varfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
The scalar variance of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown
- std(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the standard deviation. See
arkouda.stdfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
The scalar standard deviation of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- cov(y: pdarray) numpy.float64#
Compute the covariance between self and y.
- Parameters:
y (pdarray) – Other pdarray used to calculate covariance
- Returns:
The scalar covariance of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- corr(y: pdarray) numpy.float64#
Compute the correlation between self and y using pearson correlation coefficient.
- Parameters:
y (pdarray) – Other pdarray used to calculate correlation
- Returns:
The scalar correlation of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- mink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- maxk(k: arkouda.dtypes.int_scalars) pdarray#
Compute the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmaxk(k: arkouda.dtypes.int_scalars) pdarray#
Finds the indices corresponding to the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- value_counts()#
Count the occurrences of the unique values of self.
- Returns:
unique_values (pdarray) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs
Examples
>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts() (array([0, 2, 4]), array([3, 2, 1]))
- astype(dtype) pdarray#
Cast values of pdarray to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- slice_bits(low, high) pdarray#
Returns a pdarray containing only bits from low to high of self.
This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)
- Parameters:
low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0
high (int) – The highest bit included in the slice (inclusive)
- Returns:
A new pdarray containing the bits of self from low to high
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> p = ak.array([2**65 + (2**64 - 1)]) >>> bin(p[0]) '0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0]) '0b10'
- bigint_to_uint_arrays() List[pdarray]#
Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Returns:
A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Return type:
List[pdarrays]
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> a = ak.arange(2**64, 2**64 + 5) >>> a array(["18446744073709551616" "18446744073709551617" "18446744073709551618" "18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays() [array([1 1 1 1 1]), array([0 1 2 3 4])]
- reshape(*shape, order='row_major')#
Gives a new shape to an array without changing its data.
- Parameters:
shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.
order (str {'row_major' | 'C' | 'column_major' | 'F'}) – Read the elements of the pdarray in this index order By default, read the elements in row_major or C-like order where the last index changes the fastest If ‘column_major’ or ‘F’, read the elements in column_major or Fortran-like order where the first index changes the fastest
- Returns:
An arrayview object with the data from the array but with the new shape
- Return type:
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same attributes and data as the pdarray
- Return type:
np.ndarray
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_ndarray() array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray()) numpy.ndarray
- to_list() List#
Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A list with the same data as the pdarray
- Return type:
list
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_list() [0, 1, 2, 3, 4]
>>> type(a.to_list()) list
- to_cuda()#
Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.
- Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
- Return type:
numba.DeviceNDArray
- Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_cuda() array([0, 1, 2, 3, 4])
>>> type(a.to_cuda()) numpy.devicendarray
- to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str#
Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_parquet('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_parqet('path/prefix.parquet', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str#
Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_hdf('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_hdf('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving to a single file >>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single') Saves the array in to single hdf5 file on the root node. ``cwd/path/name_prefix.hdf5``
- update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)#
Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)#
Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.
- prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- dataset: str
Column name to save the pdarray under. Defaults to “array”.
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
str reponse message
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string
See also
save_all,load,read,to_parquet,to_hdfNotes
The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. Previously all files saved in Parquet format were saved with a.parquetfile extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.save('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.save('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving with an extension (Parquet) >>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet') Saves the array in numLocales Parquet files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- register(user_defined_name: str) pdarray#
Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.
- Parameters:
user_defined_name (str) – user defined name array is to be registered under
- Returns:
The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.
See also
attach,unregister,is_registered,list_registry,unregister_pdarray_by_nameNotes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- unregister() None#
Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- static attach(user_defined_name: str) pdarray#
class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which array was registered under
- Returns:
pdarray which is bound to the corresponding server side component which was registered with user_defined_name
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- arkouda.from_series(series: pandas.Series, dtype: type | str | None = None) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Converts a Pandas Series to an Arkouda pdarray or Strings object. If dtype is None, the dtype is inferred from the Pandas Series. Otherwise, the dtype parameter is set if the dtype of the Pandas Series is to be overridden or is unknown (for example, in situations where the Series dtype is object).
- Parameters:
series (Pandas Series) – The Pandas Series with a dtype of bool, float64, int64, or string
dtype (Optional[type]) – The valid dtype types are np.bool, np.float64, np.int64, and np.str
- Return type:
- Raises:
TypeError – Raised if series is not a Pandas Series object
ValueError – Raised if the Series dtype is not bool, float64, int64, string, datetime, or timedelta
Examples
>>> ak.from_series(pd.Series(np.random.randint(0,10,5))) array([9, 0, 4, 7, 9])
>>> ak.from_series(pd.Series(['1', '2', '3', '4', '5']),dtype=np.int64) array([1, 2, 3, 4, 5])
>>> ak.from_series(pd.Series(np.random.uniform(low=0.0,high=1.0,size=3))) array([0.57600036956445599, 0.41619265571741659, 0.6615356693784662])
>>> ak.from_series(pd.Series(['0.57600036956445599', '0.41619265571741659', '0.6615356693784662']), dtype=np.float64) array([0.57600036956445599, 0.41619265571741659, 0.6615356693784662])
>>> ak.from_series(pd.Series(np.random.choice([True, False],size=5))) array([True, False, True, True, True])
>>> ak.from_series(pd.Series(['True', 'False', 'False', 'True', 'True']), dtype=np.bool) array([True, True, True, True, True])
>>> ak.from_series(pd.Series(['a', 'b', 'c', 'd', 'e'], dtype="string")) array(['a', 'b', 'c', 'd', 'e'])
>>> ak.from_series(pd.Series(['a', 'b', 'c', 'd', 'e']),dtype=np.str) array(['a', 'b', 'c', 'd', 'e'])
>>> ak.from_series(pd.Series(pd.to_datetime(['1/1/2018', np.datetime64('2018-01-01')]))) array([1514764800000000000, 1514764800000000000])
Notes
The supported datatypes are bool, float64, int64, string, and datetime64[ns]. The data type is either inferred from the the Series or is set via the dtype parameter.
Series of datetime or timedelta are converted to Arkouda arrays of dtype int64 (nanoseconds)
A Pandas Series containing strings has a dtype of object. Arkouda assumes the Series contains strings and sets the dtype to str
- class arkouda.Datetime(pda, unit: str = _BASE_UNIT)#
Bases:
_AbstractBaseTimeRepresents a date and/or time.
Datetime is the Arkouda analog to pandas DatetimeIndex and other timeseries data types.
- Parameters:
pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array) –
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The
.valuesattribute is always in nanoseconds with int64 dtype.- property nanosecond#
- property microsecond#
- property millisecond#
- property second#
- property minute#
- property hour#
- property day#
- property month#
- property year#
- property day_of_year#
- property dayofyear#
- property day_of_week#
- property dayofweek#
- property weekday#
- property week#
- property weekofyear#
- property date#
- property is_leap_year#
- supported_with_datetime#
- supported_with_r_datetime#
- supported_with_timedelta#
- supported_with_r_timedelta#
- supported_opeq#
- supported_with_pdarray#
- supported_with_r_pdarray#
- special_objType = 'Datetime'#
- isocalendar()#
- to_pandas()#
Convert array to a pandas DatetimeIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.
See also
to_ndarray
- sum()#
Return the sum of all elements in the array.
- register(user_defined_name)#
Register this Datetime object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Datetime is to be registered under, this will be the root name for underlying components
- Returns:
The same Datetime which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Datetimes with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Datetimes with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this Datetime object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- class arkouda.Timedelta(pda, unit: str = _BASE_UNIT)#
Bases:
_AbstractBaseTimeRepresents a duration, the difference between two dates or times.
Timedelta is the Arkouda equivalent of pandas.TimedeltaIndex.
- Parameters:
pda (int64 pdarray, pd.TimedeltaIndex, pd.Series, or np.timedelta64 array) –
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The
.valuesattribute is always in nanoseconds with int64 dtype.- property nanoseconds#
- property microseconds#
- property seconds#
- property days#
- property components#
- supported_with_datetime#
- supported_with_r_datetime#
- supported_with_timedelta#
- supported_with_r_timedelta#
- supported_opeq#
- supported_with_pdarray#
- supported_with_r_pdarray#
- special_objType = 'Timedelta'#
- total_seconds()#
- to_pandas()#
Convert array to a pandas TimedeltaIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.
See also
to_ndarray
- std(ddof: arkouda.dtypes.int_scalars = 0)#
Returns the standard deviation as a pd.Timedelta object
- sum()#
Return the sum of all elements in the array.
- abs()#
Absolute value of time interval.
- register(user_defined_name)#
Register this Timedelta object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the timedelta is to be registered under, this will be the root name for underlying components
- Returns:
The same Timedelta which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Timedeltas with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the timedelta with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this timedelta object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- arkouda.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, inclusive='both', **kwargs)#
Creates a fixed frequency Datetime range. Alias for
ak.Datetime(pd.date_range(args)). Subject to size limit imposed by client.maxTransferBytes.- Parameters:
start (str or datetime-like, optional) – Left bound for generating dates.
end (str or datetime-like, optional) – Right bound for generating dates.
periods (int, optional) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’. See timeseries.offset_aliases for a list of frequency aliases.
tz (str or tzinfo, optional) – Time zone name for returning localized DatetimeIndex, for example ‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is timezone-naive.
normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.
name (str, default None) – Name of the resulting DatetimeIndex.
closed ({None, 'left', 'right'}, optional) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None, the default). Deprecated
inclusive ({"both", "neither", "left", "right"}, default "both") – Include boundaries. Whether to set each bound as closed or open.
**kwargs – For compatibility. Has no effect on the result.
- Returns:
rng
- Return type:
DatetimeIndex
Notes
Of the four parameters
start,end,periods, andfreq, exactly three must be specified. Iffreqis omitted, the resultingDatetimeIndexwill haveperiodslinearly spaced elements betweenstartandend(closed on both sides).To learn more about the frequency strings, please see this link.
- arkouda.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None, **kwargs)#
Return a fixed frequency TimedeltaIndex, with day as the default frequency. Alias for
ak.Timedelta(pd.timedelta_range(args)). Subject to size limit imposed by client.maxTransferBytes.- Parameters:
start (str or timedelta-like, default None) – Left bound for generating timedeltas.
end (str or timedelta-like, default None) – Right bound for generating timedeltas.
periods (int, default None) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’.
name (str, default None) – Name of the resulting TimedeltaIndex.
closed (str, default None) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None).
- Returns:
rng
- Return type:
TimedeltaIndex
Notes
Of the four parameters
start,end,periods, andfreq, exactly three must be specified. Iffreqis omitted, the resultingTimedeltaIndexwill haveperiodslinearly spaced elements betweenstartandend(closed on both sides).To learn more about the frequency strings, please see this link.
- arkouda.AllSymbols = '__AllSymbols__'#
- arkouda.RegisteredSymbols = '__RegisteredSymbols__'#
- arkouda.information(names: List[str] | str = RegisteredSymbols) str#
Returns JSON formatted string containing information about the objects in names
- Parameters:
names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info if names is ak.AllSymbols, retrieves info for all symbols in the symbol table if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry
- Returns:
JSON formatted string containing a list of information for each object in names
- Return type:
str
- Raises:
RuntimeError – Raised if a server-side error is thrown in the process of retrieving information about the objects in names
- arkouda.list_registry(detailed: bool = False)#
Return a list containing the names of all registered objects
- Parameters:
detailed (bool) – Default = False Return details of registry objects. Currently includes object type for any objects
- Returns:
Dict containing keys “Components” and “Objects”.
- Return type:
dict
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.list_symbol_table() List[str]#
Return a list containing the names of all objects in the symbol table
- Parameters:
None –
- Returns:
List of all object names in the symbol table
- Return type:
list
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.pretty_print_information(names: List[str] | str = RegisteredSymbols) None#
Prints verbose information for each object in names in a human readable format
- Parameters:
names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info if names is ak.AllSymbols, retrieves info for all symbols in the symbol table if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown in the process of retrieving information about the objects in names
- arkouda.akbool#
- arkouda.akint64#
- arkouda.isSupportedInt(num)#
- arkouda.str_#
- arkouda.int_scalars#
- arkouda.akuint64#
- class arkouda.GroupBy(keys: groupable | None = None, assume_sorted: bool = False, **kwargs)#
Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.
- Parameters:
keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row
assume_sorted (bool) – If True, assume keys is already sorted (Default: False)
- nkeys#
The number of key arrays (columns)
- Type:
int
- size#
The length of the input array(s), i.e. number of rows
- Type:
int
- unique_keys#
The unique values of the keys array(s), in grouped order
- Type:
(list of) pdarray, Strings, or Categorical
- ngroups#
The length of the unique_keys array(s), i.e. number of groups
- Type:
int
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
- Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that groups the array
If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.
- Reductions#
- objType = 'GroupBy'#
- static from_return_msg(rep_msg)#
- to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')#
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Returns:
None
GroupBy is not currently supported by Parquet
- update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)#
- size() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
See also
Notes
This alias for “count” was added to conform with Pandas API
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.size() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- count() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- aggregate(values: groupable, operator: str, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, groupable]#
Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.
- Parameters:
values (pdarray) – The values to group and reduce
operator (str) – The name of the reduction operator to use
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
aggregates (groupable) – One aggregate value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if the requested operator is not supported for the values dtype
Examples
>>> keys = ak.arange(0, 10) >>> vals = ak.linspace(-1, 1, 10) >>> g = ak.GroupBy(keys) >>> g.aggregate(vals, 'sum') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768, -0.55555555555555536, -0.33333333333333348, -0.11111111111111116, 0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768, 1])) >>> g.aggregate(vals, 'min') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779, -0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116, 0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
- sum(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.
- Parameters:
values (pdarray) – The values to group and sum
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_sums (pdarray) – One sum per unique key in the GroupBy instance
skipna (bool) – boolean which determines if NANs should be skipped
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The grouped sum of a boolean
pdarrayreturns integers.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.sum(b) (array([2, 3, 4]), array([8, 14, 6]))
- prod(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.
- Parameters:
values (pdarray) – The values to group and multiply
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_products (pdarray, float64) – One product per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if prod is not supported for the values dtype
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.prod(b) (array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
- var(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.
- Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean, i.e.,
var = mean((x - x.mean())**2).The mean is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of a hypothetical infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.var(b) (array([2 3 4]), array([2.333333333333333 1.2 0]))
- std(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.
- Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,
std = sqrt(mean((x - x.mean())**2)).The average squared deviation is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of the infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1, it will not be an unbiased estimate of the standard deviation per se.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.std(b) (array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
- mean(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.
- Parameters:
values (pdarray) – The values to group and average
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.mean(b) (array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
- median(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.
- Parameters:
values (pdarray) – The values to group and find median
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,9) >>> a array([4 1 4 3 2 2 2 3 3]) >>> g = ak.GroupBy(a) >>> g.keys array([4 1 4 3 2 2 2 3 3]) >>> b = ak.linspace(-5,5,9) >>> b array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5]) >>> g.median(b) (array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
- min(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find minima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_minima (pdarray) – One minimum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if min is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.min(b) (array([2, 3, 4]), array([1, 1, 3]))
- max(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find maxima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_maxima (pdarray) – One maximum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if max is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.max(b) (array([2, 3, 4]), array([4, 4, 3]))
- argmin(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmin
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmin(b) (array([2, 3, 4]), array([5, 4, 2]))
- argmax(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmax
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmax(b) (array([2, 3, 4]), array([9, 3, 2]))
- nunique(values: groupable) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.
- Parameters:
values (pdarray, int64) – The values to group and find unique values
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if nunique is not supported for the values dtype
Examples
>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> data array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> labels ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g = ak.GroupBy(labels) >>> g.keys ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g.nunique(data) array([1,2,3,4]), array([2, 2, 3, 1]) # Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4 # Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4 # Group (3,3,3) has values [3,4,1] -> 3 unique values # Group (4) has values [4] -> 1 unique value
- any(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “or”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
- all(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “and”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- OR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise OR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with OR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- AND(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise AND of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with AND
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- XOR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise XOR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with XOR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- first(values: groupable_element_type) Tuple[groupable, groupable_element_type]#
First value in each group.
- Parameters:
values (pdarray-like) – The values from which to take the first of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first value of each group
- mode(values: groupable) Tuple[groupable, groupable]#
Most common value in each group. If a group is multi-modal, return the modal value that occurs first.
- Parameters:
values ((list of) pdarray-like) – The values from which to take the mode of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) pdarray-like) – The most common value of each group
- unique(values: groupable)#
Return the set of unique values in each group, as a SegArray.
- Parameters:
values ((list of) pdarray-like) – The values to unique
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) SegArray) – The unique values of each group
- Raises:
TypeError – Raised if values is or contains Strings or Categorical
- broadcast(values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, permute: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Fill each group’s segment with a constant value.
- Parameters:
- Returns:
The broadcasted values
- Return type:
- Raises:
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one value per segment
Notes
This function is a sparse analog of
np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.Examples
>>> a = ak.array([0, 1, 0, 1, 0]) >>> values = ak.array([3, 5]) >>> g = ak.GroupBy(a) # By default, result is in original order >>> g.broadcast(values) array([3, 5, 3, 5, 3]) # With permute=False, result is in grouped order >>> g.broadcast(values, permute=False) array([3, 3, 3, 5, 5] >>> a = ak.randint(1,5,10) >>> a array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> g.broadcast(counts > 2) array([True False True True True False True True False False]) >>> g.broadcast(counts == 3) array([True False True True True False True True False False]) >>> g.broadcast(counts < 4) array([True True True True True True True True True True])
- static build_from_components(user_defined_name: str = None, **kwargs) GroupBy#
function to build a new GroupBy object from component keys and permutation.
- Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
- Returns:
The GroupBy object created by using the given components
- Return type:
- register(user_defined_name: str) GroupBy#
Register this GroupBy object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components
- Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the GroupBy with the user_defined_name
See also
unregister,attach,unregister_groupby_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() bool#
Return True if the object is contained in the registry
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static attach(user_defined_name: str) GroupBy#
Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which GroupBy object was registered under
- Returns:
The GroupBy object created by re-attaching to the corresponding server components
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
register,is_registered,unregister,unregister_groupby_by_name
- static unregister_groupby_by_name(user_defined_name: str) None#
Function to unregister GroupBy object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the GroupBy object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- most_common(values)#
(Deprecated) See GroupBy.mode().
- arkouda.broadcast(segments: arkouda.pdarrayclass.pdarray, values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, size: int | numpy.int64 | numpy.uint64 = -1, permutation: arkouda.pdarrayclass.pdarray | None = None)#
Broadcast a dense column vector to the rows of a sparse matrix or grouped array.
- Parameters:
segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.
values (pdarray, Strings) – The values to broadcast, one per row (or group)
size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.
permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.
- Returns:
The broadcast values, one per nonzero
- Return type:
- Raises:
ValueError –
If segments and values are different sizes
If segments are empty
If number of nonzeros (either user-specified or inferred from permutation) is less than one
Examples
>>> # Define a sparse matrix with 3 rows and 7 nonzeros >>> row_starts = ak.array([0, 2, 5]) >>> nnz = 7 # Broadcast the row number to each nonzero element >>> row_number = ak.arange(3) >>> ak.broadcast(row_starts, row_number, nnz) array([0 0 1 1 1 2 2]) # If the original nonzeros were in reverse order... >>> permutation = ak.arange(6, -1, -1) >>> ak.broadcast(row_starts, row_number, permutation=permutation) array([2 2 1 1 1 0 0])
- arkouda.gen_ranges(starts, ends, stride=1)#
Generate a segmented array of variable-length, contiguous ranges between pairs of start- and end-points.
- Parameters:
- Returns:
segments (pdarray, int64) – The starting index of each range in the resulting array
ranges (pdarray, int64) – The actual ranges, flattened into a single array
- arkouda.getArkoudaLogger(name: str, handlers: List[logging.Handler] | None = None, logFormat: str | None = ArkoudaLogger.DEFAULT_LOG_FORMAT, logLevel: LogLevel = None) ArkoudaLogger#
A convenience method for instantiating an ArkoudaLogger that retrieves the logging level from the ARKOUDA_LOG_LEVEL env variable
- Parameters:
name (str) – The name of the ArkoudaLogger
handlers (List[Handler]) – A list of logging.Handler objects, if None, a list consisting of one StreamHandler named ‘console-handler’ is generated and configured
logFormat (str) – The format for log messages, defaults to the following format: ‘[%(name)s] Line %(lineno)d %(levelname)s: %(message)s’
- Return type:
ArkoudaLogger
- Raises:
TypeError – Raised if either name or logFormat is not a str object or if handlers is not a list of str objects
Notes
Important note: if a list of 1..n logging.Handler objects is passed in, and dynamic changes to 1..n handlers is desired, set a name for each Handler object as follows: handler.name = <desired name>, which will enable retrieval and updates for the specified handler.
- arkouda.cumsum(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Return the cumulative sum over the array.
The sum is inclusive, such that the
ith element of the result is the sum of elements up to and includingi.- Parameters:
pda (pdarray) –
- Returns:
A pdarray containing cumulative sums for each element of the original pdarray
- Return type:
- Raises:
TypeError – Raised if the parameter is not a pdarray
Examples
>>> ak.cumsum(ak.arange([1,5])) array([1, 3, 6])
>>> ak.cumsum(ak.uniform(5,1.0,5.0)) array([3.1598310770203937, 5.4110385860243131, 9.1622479306453748, 12.710615785506533, 13.945880905466208])
>>> ak.cumsum(ak.randint(0, 1, 5, dtype=ak.bool)) array([0, 1, 1, 2, 3])
- exception arkouda.RegistrationError#
Bases:
ExceptionError/Exception used when the Arkouda Server cannot register an object
- arkouda.create_pdarray(repMsg: str, max_bits=None) pdarray#
Return a pdarray instance pointing to an array created by the arkouda server. The user should not call this function directly.
- Parameters:
repMsg (str) – space-delimited string containing the pdarray name, datatype, size dimension, shape,and itemsize
- Returns:
A pdarray with the same attributes and data as the pdarray; on GPU
- Return type:
- Raises:
ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance
RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance
- arkouda.is_sorted(pda: pdarray) numpy.bool_#
Return True iff the array is monotonically non-decreasing.
- Parameters:
pda (pdarray) – The pdarray instance to be evaluated
- Returns:
Indicates if the array is monotonically non-decreasing
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.dtypes.int_scalars, ndim: arkouda.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.dtypes.int_scalars, max_bits: int | None = None)#
The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.
- name#
The server-side identifier for the array
- Type:
str
- dtype#
The element type of the array
- Type:
dtype
- size#
The number of elements in the array
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
A list or tuple containing the sizes of each dimension of the array
- Type:
Sequence[int]
- itemsize#
The size in bytes of each element
- Type:
int_scalars
- property max_bits#
- BinOps#
- OpEqOps#
- objType = 'pdarray'#
- format_other(other) str#
Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.
- Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
- Return type:
string representation of np.dtype corresponding to the other parameter
- Raises:
TypeError – Raised if the other parameter cannot be converted to Numpy dtype
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a pdarray to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- opeq(other, op)#
- fill(value: arkouda.dtypes.numeric_scalars) None#
Fill the array (in place) with a constant value.
- Parameters:
value (numeric_scalars) –
- Raises:
TypeError – Raised if value is not an int, int64, float, or float64
- any() numpy.bool_#
Return True iff any element of the array evaluates to True.
- all() numpy.bool_#
Return True iff all elements of the array evaluate to True.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
Note
This will return True if the object is registered itself or as a component of another object
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- is_sorted() numpy.bool_#
Return True iff the array is monotonically non-decreasing.
- Parameters:
None –
- Returns:
Indicates if the array is monotonically non-decreasing
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- sum() arkouda.dtypes.numeric_and_bool_scalars#
Return the sum of all elements in the array.
- prod() numpy.float64#
Return the product of all elements in the array. Return value is always a np.float64 or np.int64.
- min() arkouda.dtypes.numpy_scalars#
Return the minimum value of the array.
- max() arkouda.dtypes.numpy_scalars#
Return the maximum value of the array.
- argmin() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array min value
- argmax() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array max value.
- mean() numpy.float64#
Return the mean of the array.
- var(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the variance. See
arkouda.varfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
The scalar variance of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown
- std(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the standard deviation. See
arkouda.stdfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
The scalar standard deviation of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- cov(y: pdarray) numpy.float64#
Compute the covariance between self and y.
- Parameters:
y (pdarray) – Other pdarray used to calculate covariance
- Returns:
The scalar covariance of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- corr(y: pdarray) numpy.float64#
Compute the correlation between self and y using pearson correlation coefficient.
- Parameters:
y (pdarray) – Other pdarray used to calculate correlation
- Returns:
The scalar correlation of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- mink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- maxk(k: arkouda.dtypes.int_scalars) pdarray#
Compute the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmaxk(k: arkouda.dtypes.int_scalars) pdarray#
Finds the indices corresponding to the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- value_counts()#
Count the occurrences of the unique values of self.
- Returns:
unique_values (pdarray) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs
Examples
>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts() (array([0, 2, 4]), array([3, 2, 1]))
- astype(dtype) pdarray#
Cast values of pdarray to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- slice_bits(low, high) pdarray#
Returns a pdarray containing only bits from low to high of self.
This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)
- Parameters:
low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0
high (int) – The highest bit included in the slice (inclusive)
- Returns:
A new pdarray containing the bits of self from low to high
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> p = ak.array([2**65 + (2**64 - 1)]) >>> bin(p[0]) '0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0]) '0b10'
- bigint_to_uint_arrays() List[pdarray]#
Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Returns:
A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Return type:
List[pdarrays]
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> a = ak.arange(2**64, 2**64 + 5) >>> a array(["18446744073709551616" "18446744073709551617" "18446744073709551618" "18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays() [array([1 1 1 1 1]), array([0 1 2 3 4])]
- reshape(*shape, order='row_major')#
Gives a new shape to an array without changing its data.
- Parameters:
shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.
order (str {'row_major' | 'C' | 'column_major' | 'F'}) – Read the elements of the pdarray in this index order By default, read the elements in row_major or C-like order where the last index changes the fastest If ‘column_major’ or ‘F’, read the elements in column_major or Fortran-like order where the first index changes the fastest
- Returns:
An arrayview object with the data from the array but with the new shape
- Return type:
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same attributes and data as the pdarray
- Return type:
np.ndarray
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_ndarray() array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray()) numpy.ndarray
- to_list() List#
Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A list with the same data as the pdarray
- Return type:
list
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_list() [0, 1, 2, 3, 4]
>>> type(a.to_list()) list
- to_cuda()#
Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.
- Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
- Return type:
numba.DeviceNDArray
- Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_cuda() array([0, 1, 2, 3, 4])
>>> type(a.to_cuda()) numpy.devicendarray
- to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str#
Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_parquet('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_parqet('path/prefix.parquet', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str#
Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_hdf('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_hdf('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving to a single file >>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single') Saves the array in to single hdf5 file on the root node. ``cwd/path/name_prefix.hdf5``
- update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)#
Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)#
Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.
- prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- dataset: str
Column name to save the pdarray under. Defaults to “array”.
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
str reponse message
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string
See also
save_all,load,read,to_parquet,to_hdfNotes
The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. Previously all files saved in Parquet format were saved with a.parquetfile extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.save('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.save('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving with an extension (Parquet) >>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet') Saves the array in numLocales Parquet files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- register(user_defined_name: str) pdarray#
Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.
- Parameters:
user_defined_name (str) – user defined name array is to be registered under
- Returns:
The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.
See also
attach,unregister,is_registered,list_registry,unregister_pdarray_by_nameNotes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- unregister() None#
Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- static attach(user_defined_name: str) pdarray#
class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which array was registered under
- Returns:
pdarray which is bound to the corresponding server side component which was registered with user_defined_name
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- arkouda.arange(*args, **kwargs) arkouda.pdarrayclass.pdarray#
arange([start,] stop[, stride,] dtype=int64)
Create a pdarray of consecutive integers within the interval [start, stop). If only one arg is given then arg is the stop parameter. If two args are given, then the first arg is start and second is stop. If three args are given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
- Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stop (int_scalars) – Stopping value (exclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1, if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Integers from start (inclusive) to stop (exclusive) by stride
- Return type:
pdarray, dtype
- Raises:
TypeError – Raised if start, stop, or stride is not an int object
ZeroDivisionError – Raised if stride == 0
Notes
Negative strides result in decreasing values. Currently, only int64 pdarrays can be created with this method. For float64 arrays, use the linspace method.
Examples
>>> ak.arange(0, 5, 1) array([0, 1, 2, 3, 4])
>>> ak.arange(5, 0, -1) array([5, 4, 3, 2, 1])
>>> ak.arange(0, 10, 2) array([0, 2, 4, 6, 8])
>>> ak.arange(-5, -10, -1) array([-5, -6, -7, -8, -9])
- arkouda.array(a: arkouda.pdarrayclass.pdarray | numpy.ndarray | Iterable, dtype: numpy.dtype | type | str = None, max_bits: int = -1) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Convert a Python or Numpy Iterable to a pdarray or Strings object, sending the corresponding data to the arkouda server.
- Parameters:
a (Union[pdarray, np.ndarray]) – Rank-1 array of a supported dtype
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
A pdarray instance stored on arkouda server or Strings instance, which is composed of two pdarrays stored on arkouda server
- Return type:
- Raises:
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a list, array, tuple, or deque
RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is not supported (not in DTypes), or if the product of a size and a.itemsize > maxTransferBytes
ValueError – Raised if the returned message is malformed or does not contain the fields required to generate the array.
See also
Notes
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overwhelming the connection between the Python client and the arkouda server, under the assumption that it is a low-bandwidth connection. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively to create the Strings object and the two corresponding pdarrays for string bytes and offsets, respectively.
Examples
>>> ak.array(np.arange(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> ak.array(range(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> strings = ak.array([f'string {i}' for i in range(0,5)]) >>> type(strings) <class 'arkouda.strings.Strings'>
- arkouda.ones(size: arkouda.dtypes.int_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray#
Create a pdarray filled with ones.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
dtype (Union[float64, int64, bool]) – Resulting array type, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Ones of the requested size and dtype
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
Examples
>>> ak.ones(5, dtype=ak.int64) array([1, 1, 1, 1, 1])
>>> ak.ones(5, dtype=ak.float64) array([1, 1, 1, 1, 1])
>>> ak.ones(5, dtype=ak.bool) array([True, True, True, True, True])
- arkouda.zeros(size: arkouda.dtypes.int_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray#
Create a pdarray filled with zeros.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
dtype (all_scalars) – Type of resulting array, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Zeros of the requested size and dtype
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
See also
Examples
>>> ak.zeros(5, dtype=ak.int64) array([0, 0, 0, 0, 0])
>>> ak.zeros(5, dtype=ak.float64) array([0, 0, 0, 0, 0])
>>> ak.zeros(5, dtype=ak.bool) array([False, False, False, False, False])
- arkouda.concatenate(arrays: Sequence[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical], ordered: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical#
Concatenate a list or tuple of
pdarrayorStringsobjects into onepdarrayorStringsobject, respectively.- Parameters:
arrays (Sequence[Union[pdarray,Strings,Categorical]]) – The arrays to concatenate. Must all have same dtype.
ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.
- Returns:
Single pdarray or Strings object containing all values, returned in the original order
- Return type:
Union[pdarray,Strings,Categorical]
- Raises:
ValueError – Raised if arrays is empty or if 1..n pdarrays have differing dtypes
TypeError – Raised if arrays is not a pdarrays or Strings python Sequence such as a list or tuple
RuntimeError – Raised if 1..n array elements are dtypes for which concatenate has not been implemented.
Examples
>>> ak.concatenate([ak.array([1, 2, 3]), ak.array([4, 5, 6])]) array([1, 2, 3, 4, 5, 6])
>>> ak.concatenate([ak.array([True,False,True]),ak.array([False,True,True])]) array([True, False, True, False, True, True])
>>> ak.concatenate([ak.array(['one','two']),ak.array(['three','four','five'])]) array(['one', 'two', 'three', 'four', 'five'])
- class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.dtypes.int_scalars)#
Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.
- entry#
Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of
offsets array: starting indices for each string
bytes array: raw bytes of all strings joined by nulls
- Type:
- size#
The number of strings in the array
- Type:
int_scalars
- nbytes#
The total number of bytes in all strings
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
The sizes of each dimension of the array
- Type:
tuple
- dtype#
The dtype is ak.str
- Type:
dtype
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
Notes
Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.
- BinOps#
- objType = 'Strings'#
- static from_return_msg(rep_msg: str) Strings#
Factory method for creating a Strings object from an Arkouda server response message
- Parameters:
rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.
- static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings#
Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.
- Parameters:
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.
- get_lengths() arkouda.pdarrayclass.pdarray#
Return the length of each string in the array.
- Returns:
The length of each string
- Return type:
pdarray, int
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- get_bytes()#
Getter for the bytes component (uint8 pdarray) of this Strings.
- Returns:
Pdarray of bytes of the string accessed
- Return type:
pdarray, uint8
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_bytes() [111 110 101 0 116 119 111 0 116 104 114 101 101 0]
- get_offsets()#
Getter for the offsets component (int64 pdarray) of this Strings.
- Returns:
Pdarray of offsets of the string accessed
- Return type:
pdarray, int64
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_offsets() [0 4 8]
- encode(toEncoding: str, fromEncoding: str = 'UTF-8')#
Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding
- Parameters:
toEncoding (str) – The encoding that the strings will be converted to
fromEncoding (str) – The current encoding of the strings object, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- decode(fromEncoding, toEncoding='UTF-8')#
Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding
- Parameters:
fromEncoding (str) – The current encoding of the strings object
toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- to_lower() Strings#
Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Returns:
Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_lower() array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
- to_upper() Strings#
Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Returns:
Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_upper() array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
- to_title() Strings#
Returns a new Strings from the original replaced with their titlecase equivalent
- Returns:
Strings from the original replaced with their titlecase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Strings.to_lower,String.to_upperExamples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_title() array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
- is_lower() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase
- Returns:
True for elements that are entirely lowercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_lower() array([True True True False False False])
- is_upper() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase
- Returns:
True for elements that are entirely uppercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_upper() array([False False False True True True])
- is_title() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase
- Returns:
True for elements that are titlecase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)]) >>> title = ak.array([f'Strings {i}' for i in range(3)]) >>> strings = ak.concatenate([mixed, title]) >>> strings array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2']) >>> strings.is_title() array([False False False True True True])
- strip(chars: bytes | arkouda.dtypes.str_scalars | None = '') Strings#
Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
- Parameters:
chars – the set of characters to be removed
- Returns:
Strings object with the leading and trailing characters matching the set of characters in the chars argument removed
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> strings = ak.array(['Strings ', ' StringS ', 'StringS ']) >>> s = strings.strip() >>> s array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS ', ' 1StringS 12 ']) >>> s = strings.strip(' 12') >>> s array(['Strings', 'StringS', 'StringS'])
- cached_regex_patterns() List#
Returns the regex patterns for which Match objects have been cached
- purge_cached_regex_patterns() None#
purges cached regex patterns
- find_locations(pattern: bytes | arkouda.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches
- Parameters:
pattern (str_scalars) – The regex pattern used to find matches
- Returns:
pdarray, int64 – For each original string, the number of pattern matches
pdarray, int64 – The start positons of pattern matches
pdarray, int64 – The lengths of pattern matches
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> num_matches, starts, lens = strings.find_locations('\d') >>> num_matches array([2, 2, 2, 2, 2]) >>> starts array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9]) >>> lens array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
- search(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match if any part of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+') <ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- match(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the beginning of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the beginning of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.match('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- fullmatch(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the whole string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the whole string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.fullmatch('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=False; matched=False>
- split(pattern: bytes | arkouda.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple#
Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur
- Parameters:
pattern (str) – Regex used to split strings into substrings
maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences
return_segments (bool) – If True, return mapping of original strings to first substring in return array.
- Returns:
Strings – Substrings with pattern matches removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.split('_+', maxsplit=2, return_segments=True) (array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
- findall(pattern: bytes | arkouda.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple#
Return a new Strings containg all non-overlapping matches of pattern
- Parameters:
pattern (str_scalars) – Regex used to find matches
return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from
- Returns:
Strings – Strings object containing only pattern matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.findall('_+', return_match_origins=True) (array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
- sub(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Strings#
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings with pattern matches replaced
- Return type:
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.sub(pattern='_+', repl='-', count=2) array(['1-2-', '-', '3', '-4-5____6___7', ''])
- subn(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Tuple#
Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings – Strings with pattern matches replaced
pdarray, int64 – The number of substitutions made for each element of Strings
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.subn(pattern='_+', repl='-', count=2) (array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
- contains(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element contains the given substring.
- Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> strings array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5']) >>> strings.contains('string') array([True, True, True, True, True]) >>> strings.contains('string \d', regex=True) array([True, True, True, True, True])
- startswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The prefix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not a bytes ior str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.startswith('string') array([True, True, True, True, True]) >>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.startswith('\d str', regex = True) array([True, True, True, True, True])
- endswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The suffix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.endswith('ing') array([True, True, True, True, True]) >>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.endswith('ing \d', regex = True) array([True, True, True, True, True])
- flatten(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple#
Unpack delimiter-joined substrings into a flat array.
- Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring in return array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> orig = ak.array(['one|two', 'three|four|five', 'six']) >>> orig.flatten('|') array(['one', 'two', 'three', 'four', 'five', 'six']) >>> flat, map = orig.flatten('|', return_segments=True) >>> map array([0, 2, 5]) >>> under = ak.array(['one_two', 'three_____four____five', 'six']) >>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True) >>> under_flat array(['one', 'two', 'three', 'four', 'five', 'six']) >>> under_map array([0, 2, 5])
- peel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple#
Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The field(s) peeled from the end of each string (unless fromRight is true)
- right: Strings
The remainder of each string after peeling (unless fromRight is true)
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g'])) >>> s.peel('.', includeDelimiter=True) (array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g'])) >>> s.peel('.', times=2) (array(['', '', 'e.f']), array(['a.b', 'c.d', 'g'])) >>> s.peel('.', times=2, keepPartial=True) (array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
- rpeel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)#
Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The remainder of the string after peeling
- right: Strings
The field(s) that were peeled from the right of each string
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.rpeel('.') (array(['a', 'c', 'e.f']), array(['b', 'd', 'g'])) # Compared against peel >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
- stick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '', toLeft: bool = False) Strings#
Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.
- Returns:
The array of joined strings
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance
ValueError – Raised if times is < 1
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.stick(t, delimiter='.') array(['a.b', 'c.d', 'e.f'])
- lstick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '') Strings#
Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
- Returns:
The array of joined strings, as other + self
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.lstick(t, delimiter='.') array(['b.a', 'd.c', 'f.e'])
- get_prefixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long prefix of each string, where possible
- Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.
- Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.
- get_suffixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long suffix of each string, where possible
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.
- Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Compute a 128-bit hash of each string.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- group() arkouda.pdarrayclass.pdarray#
Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
Notes
If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.
- Raises:
RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same strings as this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_ndarray() array(['hello', 'my', 'world'], dtype='<U5') >>> type(a.to_ndarray()) numpy.ndarray
- to_list() list#
Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list with the same strings as this SegString
- Return type:
list
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_list() ['hello', 'my', 'world'] >>> type(a.to_list()) list
- astype(dtype) arkouda.pdarrayclass.pdarray#
Cast values of Strings object to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str#
Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str#
Save the Strings object to HDF5. The object can be saved to a collection of files or single file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must have write permission.
Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path.If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a
RuntimeErrorwill result.Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
See also
- update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)#
Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)#
Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
str reponse message
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n) at this time.
- save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
- Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- register(user_defined_name: str) Strings#
Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.
- Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
- Returns:
The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- unregister() None#
Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- static attach(user_defined_name: str) Strings#
class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which the Strings object was registered under
- Returns:
the Strings object registered with user_defined_name in the arkouda server
- Return type:
Strings object
- Raises:
TypeError – Raised if user_defined_name is not a str
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- static unregister_strings_by_name(user_defined_name: str) None#
Unregister a Strings object in the arkouda server previously registered via register()
- Parameters:
user_defined_name (str) – The registered name of the Strings object
See also
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a Strings object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.SEG_SUFFIX = '_segments'#
- arkouda.VAL_SUFFIX = '_values'#
- arkouda.LEN_SUFFIX = '_lengths'#
- arkouda.segarray(segments: arkouda.pdarrayclass.pdarray, values: arkouda.pdarrayclass.pdarray, lengths=None, grouping=None)#
Alias for the from_parts function. Prevents user from needing to call ak.SegArray constructor DEPRECATED
- class arkouda.SegArray(segments, values, lengths=None, grouping=None)#
- property non_empty#
- property grouping#
- objType = 'SegArray'#
- classmethod from_parts(segments, values, lengths=None, grouping=None) SegArray#
DEPRECATED Construct a SegArray object from its parts
- Parameters:
- Returns:
Data structure representing an array whose elements are variable-length arrays.
- Return type:
Notes
Keyword args ‘lengths’ and ‘grouping’ are not user-facing. They are used by the attach method.
- classmethod from_multi_array(m)#
Construct a SegArray from a list of columns. This essentially transposes the input, resulting in an array of rows.
- classmethod concat(x, axis=0, ordered=True)#
Concatenate a sequence of SegArrays
- Parameters:
x (sequence of SegArray) – The SegArrays to concatenate
axis (0 or 1) – Select vertical (0) or horizontal (1) concatenation. If axis=1, all SegArrays must have same size.
ordered (bool) – Must be True. This option is present for compatibility only, because unordered concatenation is not yet supported.
- Returns:
The input arrays joined into one SegArray
- Return type:
- copy()#
Return a deep copy.
- get_suffixes(n, return_origins=True, proper=True)#
Return the n-long suffix of each sub-array, where possible
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a suffix.
- Returns:
suffixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-suffix. The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.
- get_prefixes(n, return_origins=True, proper=True)#
Return all sub-array prefixes of length n (for sub-arrays that are at least n+1 long)
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a prefix.
- Returns:
prefixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-prefix. The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.
- get_ngrams(n, return_origins=True)#
Return all n-grams from all sub-arrays.
- Parameters:
n (int) – Length of n-gram
return_origins (bool) – If True, return an int64 array indicating which sub-array each returned n-gram came from.
- Returns:
ngrams (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-gram.
origin_indices (pdarray, int) – The index of the sub-array from which the corresponding n-gram originated
- get_jth(j, return_origins=True, compressed=False, default=0)#
Select the j-th element of each sub-array, where possible.
- Parameters:
j (int) – The index of the value to get from each sub-array. If j is negative, it counts backwards from the end of each sub-array.
return_origins (bool) – If True, return a logical index indicating where j is in bounds
compressed (bool) – If False, return array is same size as self, with default value where j is out of bounds. If True, the return array only contains values where j is in bounds.
default (scalar) – When compressed=False, the value to return when j is out of bounds for the sub-array
- Returns:
val (pdarray) – compressed=False: The j-th value of each sub-array where j is in bounds and the default value where j is out of bounds. compressed=True: The j-th values of only the sub-arrays where j is in bounds
origin_indices (pdarray, bool) – A Boolean array that is True where j is in bounds for the sub-array.
Notes
If values are Strings, only the compressed format is supported.
- set_jth(i, j, v)#
Set the j-th element of each sub-array in a subset.
- Parameters:
- Raises:
ValueError – If j is out of bounds in any of the sub-arrays specified by i.
- get_length_n(n, return_origins=True)#
Return all sub-arrays of length n, as a list of columns.
- Parameters:
n (int) – Length of sub-arrays to select
return_origins (bool) – Return a logical index indicating which sub-arrays are length n
- Returns:
columns (list of pdarray) – An n-long list of pdarray, where each row is one of the n-long sub-arrays from the SegArray. The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Array of bool for each element of the SegArray, True where sub-array has length n.
- append(other, axis=0)#
Append other to self, either vertically (axis=0, length of resulting SegArray increases), or horizontally (axis=1, each sub-array of other appends to the corresponding sub-array of self).
- Parameters:
other (SegArray) – Array of sub-arrays to append
axis (0 or 1) – Whether to append vertically (0) or horizontally (1). If axis=1, other must be same size as self.
- Returns:
axis=0: New SegArray containing all sub-arrays axis=1: New SegArray of same length, with pairs of sub-arrays concatenated
- Return type:
- append_single(x, prepend=False)#
Append a single value to each sub-array.
- prepend_single(x)#
- remove_repeats(return_multiplicity=False)#
Condense sequences of repeated values within a sub-array to a single value.
- Parameters:
return_multiplicity (bool) – If True, also return the number of times each value was repeated.
- Returns:
norepeats (SegArray) – Sub-arrays with runs of repeated values replaced with single value
multiplicity (SegArray) – If return_multiplicity=True, this array contains the number of times each value in the returned SegArray was repeated in the original SegArray.
- to_ndarray()#
Convert the array into a numpy.ndarray containing sub-arrays
- Returns:
A numpy ndarray with the same sub-arrays (also numpy.ndarray) as this array
- Return type:
np.ndarray
Examples
>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12)) >>> segarr.to_ndarray() array([array([1, 2, 3, 4]), array([5, 6, 7]), array([8, 9, 10, 11, 12])]) >>> type(segarr.to_ndarray()) numpy.ndarray
- to_list()#
Convert the segarray into a list containing sub-arrays
- Returns:
A list with the same sub-arrays (also list) as this segarray
- Return type:
list
See also
Examples
>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12)) >>> segarr.to_list() [[0, 1, 2, 3], [4, 5, 6], [7, 8, 9, 10, 11]] >>> type(segarr.to_list()) list
- sum(x=None)#
- prod(x=None)#
- min(x=None)#
- max(x=None)#
- argmin(x=None)#
- argmax(x=None)#
- any(x=None)#
- all(x=None)#
- OR(x=None)#
- AND(x=None)#
- XOR(x=None)#
- nunique(x=None)#
- mean(x=None)#
- aggregate(op, x=None)#
- unique(x=None)#
Return sub-arrays of unique values.
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Compute a 128-bit hash of each segment.
- to_hdf(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')#
Save the SegArray to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
None
See also
- update_hdf(prefix_path: str, dataset: str = 'segarray', repack: bool = True)#
Overwrite the dataset with the name provided with this SegArray object. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the SegArray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
- to_parquet(prefix_path, dataset='segarray', mode: str = 'truncate', compression: str | None = None)#
Save the SegArray object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the object to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: Deprecated.
Parameter kept to maintain functionality of other calls. Only Truncate supported. By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – If write mode is not Truncate.
Notes
Append mode for Parquet has been deprecated. It was not implemented for SegArray.
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- save(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')#
DEPRECATED Save the SegArray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- classmethod read_hdf(prefix_path, dataset='segarray')#
Load a saved SegArray from HDF5. All arguments must match what was supplied to SegArray.save()
- Parameters:
prefix_path (str) – Directory and filename prefix
dataset (str) – Name prefix for saved data within the HDF5 files
- Return type:
- classmethod load(prefix_path, dataset='segarray', segment_name='segments', value_name='values')#
- intersect(other)#
Computes the intersection of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d intersections of the segments of self and other
- Return type:
See also
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.intersect(seg_b) SegArray([ [1, 3], [4] ])
- union(other)#
Computes the union of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d union of the segments of self and other
- Return type:
See also
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.union(seg_b) SegArray([ [1, 2, 3, 4, 5], [1, 2, 3, 4, 5] ])
- setdiff(other)#
Computes the set difference of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d set difference of the segments of self and other
- Return type:
See also
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.setdiff(seg_b) SegArray([ [2, 4], [1, 3, 5] ])
- setxor(other)#
Computes the symmetric difference of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d symmetric difference of the segments of self and other
- Return type:
See also
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.setxor(seg_b) SegArray([ [2, 4, 5], [1, 3, 5, 2] ])
- filter(filter, discard_empty: bool = False)#
Filter values out of the SegArray object
- register(user_defined_name)#
Register this SegArray object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name which this SegArray object will be registered under
- Returns:
The same SegArray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different SegArrays with the same name.
- Return type:
- Raises:
RegistrationError – Raised if the server could not register the SegArray object
Notes
Objects registered with the server are immune to deletion until they are unregistered.
See also
- unregister()#
Unregister this SegArray object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table
Notes
Objects registered with the server are immune to deletion until they are unregistered.
See also
- static unregister_segarray_by_name(user_defined_name)#
Using the defined name, remove the registered SegArray object from the Symbol Table
- Parameters:
user_defined_name (str) – user defined name which the SegArray object was registered under
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table
See also
- classmethod attach(user_defined_name)#
Using the defined name, attach to a SegArray that has been registered to the Symbol Table
- Parameters:
user_defined_name (str) – user defined name which the SegArray object was registered under
- Returns:
The resulting SegArray
- Return type:
- Raises:
RuntimeError – Raised if the server could not attach to the SegArray object
See also
- is_registered() bool#
Checks if the name of the SegArray object is registered in the Symbol Table
- Returns:
True if SegArray is registered, false if not
- Return type:
bool
See also
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a Segmented Array to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Segmented Array is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- class arkouda.DataFrame(initialdata=None, index=None)#
Bases:
collections.UserDictA DataFrame structure based on arkouda arrays.
Examples
Create an empty DataFrame and add a column of data:
>>> import arkouda as ak >>> import numpy as np >>> import pandas as pd >>> df = ak.DataFrame() >>> df['a'] = ak.array([1,2,3])
Create a new DataFrame using a dictionary of data:
>>> userName = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']) >>> userID = ak.array([111, 222, 111, 333, 222, 111]) >>> item = ak.array([0, 0, 1, 1, 2, 0]) >>> day = ak.array([5, 5, 6, 5, 6, 6]) >>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6]) >>> df = ak.DataFrame({'userName': userName, 'userID': userID, >>> 'item': item, 'day': day, 'amount': amount}) >>> df DataFrame(['userName', 'userID', 'item', 'day', 'amount'] [6 rows : 224 B])
Indexing works slightly differently than with pandas: >>> df[0] {‘userName’: ‘Alice’, ‘userID’: 111, ‘item’: 0, ‘day’: 5, ‘amount’: 0.5} >>> df[‘userID’] array([111, 222, 111, 333, 222, 111]) >>> df[‘userName’] array([‘Alice’, ‘Bob’, ‘Alice’, ‘Carol’, ‘Bob’, ‘Alice’]) >>> df[[1,5,7]]
userName userID item day amount
1 Bob 222 0 5 0.6 2 Alice 111 1 6 1.1 3 Carol 333 1 5 1.2
Note that strides are not implemented except for stride = 1. >>> df[1:5:1] DataFrame([‘userName’, ‘userID’, ‘item’, ‘day’, ‘amount’] [4 rows : 148 B]) >>> df[ak.array([1,2,3])] DataFrame([‘userName’, ‘userID’, ‘item’, ‘day’, ‘amount’] [3 rows : 112 B]) >>> df[[‘userID’, ‘day’]] DataFrame([‘userID’, ‘day’] [6 rows : 96 B])
- property size#
Returns the number of bytes on the arkouda server.
- property dtypes#
- property empty#
- property shape#
- property columns#
- property index#
- property info#
Returns a summary string of this dataframe.
- COLUMN_CLASSES = ()#
- objType = 'DataFrame'#
- transfer(hostname, port)#
Sends a DataFrame to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the DataFrame is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- classmethod from_pandas(pd_df)#
- drop(keys: str | int | List[str | int], axis: str | int = 0, inplace: bool = False) None | DataFrame#
Drop column/s or row/s from the dataframe.
- Parameters:
keys (str, int or list) – The labels to be dropped on the given axis
axis (int or str) – The axis on which to drop from. 0/’index’ - drop rows, 1/’columns’ - drop columns
inplace (bool) – Default False. When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False
None when inplace=True
Examples
Drop column >>> df.drop(‘col_name’, axis=1)
Drop Row >>> df.drop(1) or >>> df.drop(1, axis=0)
- drop_duplicates(subset=None, keep='first')#
Drops duplcated rows and returns resulting DataFrame.
If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).
- Parameters:
subset (Iterable of column names to use to dedupe.) –
keep ({'first', 'last'}, default 'first') – Determines which duplicates (if any) to keep.
- Returns:
DataFrame with duplicates removed.
- Return type:
- reset_index(size: bool = False, inplace: bool = False) None | DataFrame#
Set the index to an integer range.
Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.
- Parameters:
size (int) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.
inplace (bool) – Default False. When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False
None when inplace=True
Note
Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.
- update_size()#
Computes the number of bytes on the arkouda server.
- rename(mapper: Callable | Dict | None = None, index: Callable | Dict | None = None, column: Callable | Dict | None = None, axis: str | int = 0, inplace: bool = False) DataFrame | None#
Rename indexes or columns according to a mapping.
- Parameters:
mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. Uses the value of axis to determine if renaming column or index
column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored.
index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored
axis (int or str) – Default 0. Indicates which axis to perform the rename. 0/”index” - Indexes 1/”column” - Columns
inplace (bool) – Default False. When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False
None when inplace=True
Examples
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) Rename columns using a mapping >>> df.rename(columns={'A':'a', 'B':'c'}) a c 0 1 4 1 2 5 2 3 6
Rename indexes using a mapping >>> df.rename(index={0:99, 2:11})
A B
99 1 4 1 2 5 11 3 6
Rename using an axis style parameter >>> df.rename(str.lower, axis=’column’)
a b
0 1 4 1 2 5 2 3 6
- append(other, ordered=True)#
Concatenate data from ‘other’ onto the end of this DataFrame, in place.
Explicitly, use the arkouda concatenate function to append the data from each column in other to the end of self. This operation is done in place, in the sense that the underlying pdarrays are updated from the result of the arkouda concatenate function, rather than returning a new DataFrame object containing the result.
- Parameters:
other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.
ordered (bool) – If False, allow rows to be interleaved for better performance (but data within a row remains together). By default, append all rows to the end, in input order.
- Returns:
Appending occurs in-place, but result is returned for compatibility.
- Return type:
self
- classmethod concat(items, ordered=True)#
Essentially an append, but diffenent formatting
- head(n=5)#
Return the first n rows.
This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.
- Parameters:
n (int) – Number of rows to select.
- Returns:
The first n rows of the DataFrame.
- Return type:
ak.DataFrame
See also
- tail(n=5)#
Return the last n rows.
This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.
- Parameters:
n (int (default=5)) – Number of rows to select.
- Returns:
The last n rows of the DataFrame.
- Return type:
ak.DataFrame
See also
ak.dataframe.head
- sample(n=5)#
Return a random sample of n rows.
- Parameters:
n (int (default=5)) – Number of rows to return.
- Returns:
The sampled n rows of the DataFrame.
- Return type:
ak.DataFrame
- GroupBy(keys, use_series=False)#
Group the dataframe by a column or a list of columns.
- Parameters:
keys (string or list) – An (ordered) list of column names or a single string to group by.
use_series (If True, returns an ak.GroupBy oject. Otherwise an arkouda GroupBy object) –
- Returns:
Either an ak GroupBy or an arkouda GroupBy object.
- Return type:
See also
- memory_usage(unit='GB')#
Print the size of this DataFrame.
- Parameters:
unit (str) – Unit to return. One of {‘KB’, ‘MB’, ‘GB’}.
- Returns:
The number of bytes used by this DataFrame in [unit]s.
- Return type:
int
- to_pandas(datalimit=maxTransferBytes, retain_index=False)#
Send this DataFrame to a pandas DataFrame.
- Parameters:
datalimit (int (default=arkouda.client.maxTransferBytes)) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.
retain_index (book (default=False)) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.
- Returns:
The result of converting this DataFrame to a pandas DataFrame.
- Return type:
pandas.DataFrame
- to_hdf(path, index=False, columns=None, file_type='distribute')#
Save DataFrame to disk as hdf5, preserving column names.
- Parameters:
path (str) – File path to save data
index (bool) – If True, save the index column. By default, do not save the index.
columns (List) – List of columns to include in the file. If None, writes out all columns
file_type (str (single | distribute)) – Default: distribute Whether to save to a single file or distribute across Locales
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
See also
- update_hdf(prefix_path: str, index=False, columns=None, repack: bool = True)#
Overwrite the dataset with the name provided with this dataframe. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
index (bool) – If True, save the index column. By default, do not save the index.
columns (List) – List of columns to include in the file. If None, writes out all columns
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_parquet(path, index=False, columns=None, compression: str | None = None, convert_categoricals: bool = False)#
Save DataFrame to disk as parquet, preserving column names.
- Parameters:
path (str) – File path to save data
index (bool) – If True, save the index column. By default, do not save the index.
columns (List) – List of columns to include in the file. If None, writes out all columns
compression (str (Optional)) – Default None Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
convert_categoricals (bool) – Defaults to False Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. if set, write the equivalent Strings in place of any Categorical columns.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
- to_csv(path: str, index: bool = False, columns: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)#
Writes DataFrame to CSV file(s). File will contain a column for each column in the DataFrame. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- index: bool
Defaults to False. If True, the index of the DataFrame will be written to the file as a column
- columns: List[str] (Optional)
Column names to assign when writing data
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
None
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- classmethod read_csv(filename: str, col_delim: str = ',')#
Read the columns of a CSV file into an Arkouda DataFrame. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings objects.
- filename: str
Filename to read data from
- col_delim: str
Defaults to “,”. The delimiter for columns within the data.
Arkouda DataFrame containing the columns from the CSV file.
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
to_csv
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
- `) at this time.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
- save(path, index=False, columns=None, file_format='HDF5', file_type='distribute', compression: str | None = None)#
DEPRECATED Save DataFrame to disk, preserving column names. :param path: File path to save data :type path: str :param index: If True, save the index column. By default, do not save the index. :type index: bool :param columns: List of columns to include in the file. If None, writes out all columns :type columns: List :param file_format: ‘HDF5’ or ‘Parquet’. Defaults to ‘HDF5’ :type file_format: str :param file_type: (“single” | “distribute”)
Defaults to distribute. If single, will right a single file to locale 0
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Compression type. Only used for Parquet
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
See also
- classmethod load(prefix_path, file_format='INFER')#
Load dataframe from file file_format needed for consistency with other load functions
- argsort(key, ascending=True)#
Return the permutation that sorts the dataframe by key.
- Parameters:
key (str) – The key to sort on.
- Returns:
The permutation array that sorts the data on key.
- Return type:
ak.pdarray
- coargsort(keys, ascending=True)#
Return the permutation that sorts the dataframe by keys.
Sorting using Strings may not yield correct results
- Parameters:
keys (list) – The keys to sort on.
- Returns:
The permutation array that sorts the data on keys.
- Return type:
ak.pdarray
- sort_values(by=None, ascending=True)#
Sort the DataFrame by one or more columns.
If no column is specified, all columns are used.
Note: Fails on sorting ak.Strings when multiple columns being sorted
- Parameters:
by (str or list/tuple of str) – The name(s) of the column(s) to sort by.
ascending (bool) – Sort values in ascending (default) or descending order.
See also
- apply_permutation(perm)#
Apply a permutation to an entire DataFrame.
This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.
- Parameters:
perm (ak.pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.
See also
sort
- filter_by_range(keys, low=1, high=None)#
Find all rows where the value count of the items in a given set of columns (keys) is within the range [low, high].
To filter by a specific value, set low == high.
- Parameters:
keys (list or str) – The names of the columns to group by
low (int (default=1)) – The lowest value count.
high (int (default=None)) – The highest value count, default to unlimited.
- Returns:
An array of boolean values for qualified rows in this DataFrame.
- Return type:
See also
filter_by_count
- copy(deep=True)#
Make a copy of this object’s data.
When deep = True (default), a new object will be created with a copy of the calling object’s data. Modifications to the data of the copy will not be reflected in the original object.
When deep = False a new object will be created without copying the calling object’s data. Any changes to the data of the original object will be reflected in the shallow copy, and vice versa.
- Parameters:
deep (bool (default=True)) – When True, return a deep copy. Otherwise, return a shallow copy.
- Returns:
A deep or shallow copy according to caller specification.
- Return type:
aku.DataFrame
- groupby(keys, use_series=True)#
Group the dataframe by a column or a list of columns. Alias for GroupBy
- Parameters:
keys (a single column name or a list of column names) –
use_series (Change return type to Arkouda Groupby object.) –
- Return type:
An arkouda Groupby instance
- isin(values: arkouda.pdarrayclass.pdarray | Dict | arkouda.series.Series | DataFrame) DataFrame#
Determine whether each element in the DataFrame is contained in values.
- Parameters:
values (pdarray, dict, Series, or DataFrame) – The values to check for in DataFrame. Series can only have a single index.
- Returns:
Arkouda DataFrame of booleans showing whether each element in the DataFrame is contained in values
- Return type:
See also
ak.Series.isinNotes
Pandas supports values being an iterable type. In arkouda, we replace this with pdarray
Pandas supports ~ operations. Currently, ak.DataFrame does not support this.
Examples
>>> df = ak.DataFrame({'col_A': ak.array([7, 3]), 'col_B':ak.array([1, 9])}) >>> df col_A col_B 0 7 1 1 3 9 (2 rows x 2 columns)
When values is a pdarray, check every value in the DataFrame to determine if it exists in values >>> df.isin(ak.array([0, 1]))
col_A col_B
0 False True 1 False False (2 rows x 2 columns)
When values is a dict, the values in the dict are passed to check the column indicated by the key >>> df.isin({‘col_A’: ak.array([0, 3])})
col_A col_B
0 False False 1 True False (2 rows x 2 columns)
When values is a Series, each column is checked if values is present positionally. This means that for True to be returned, the indexes must be the same. >>> i = ak.Index(ak.arange(2)) >>> s = ak.Series(data=[3, 9], index=i) >>> df.isin(s)
col_A col_B
0 False False 1 False True (2 rows x 2 columns)
When values is a DataFrame, the index and column must match. Note that 9 is not found because the column name does not match. >>> other_df = ak.DataFrame({‘col_A’:ak.array([7, 3]), ‘col_C’:ak.array([0, 9])}) >>> df.isin(other_df)
col_A col_B
0 True False 1 True False (2 rows x 2 columns)
- corr() DataFrame#
Return new DataFrame with pairwise correlation of columns
- Returns:
Arkouda DataFrame containing correlation matrix of all columns
- Return type:
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
See also
Notes
Generates the correlation matrix using Pearson R for all columns
Attempts to convert to numeric values where possible for inclusion in the matrix.
- inner_join_merge(right: DataFrame, on: str, left_suffix: str = '_x', right_suffix: str = '_y') DataFrame#
Utilizes the ak.join.inner_join function to return an ak DataFrame object containing only rows that are in both self and right Dataframes, (based on the “on” param), as well as their associated values. For this function self is considered the left dataframe.
- Parameters:
right (DataFrame) – The Right DataFrame to be joined
on (str) – The name of the DataFrame column the join is being performed on
left_suffix (str = "_x") – A string indicating the suffix to add to columns from self for overlapping column names in both left and right. Defaults to “_x”
right_suffix (str = "_y") – A string indicating the suffix to add to columns from the other dataframe for overlapping column names in both left and right. Defaults to “_y”
- Returns:
Inner-Joined Arkouda DataFrame
- Return type:
- right_join_merge(right: DataFrame, on: str) DataFrame#
Utilizes the ak.join.inner_join_merge function to return an ak DataFrame object containing all the rows in the right Dataframe, as well as corresponding rows in self (based on the “on” param), and all of their associated values. For this function self is considered the left dataframe. Based on pandas merge functionality.
- merge(right: DataFrame, on: str, how: str, left_suffix: str = '_x', right_suffix: str = '_y') DataFrame#
Utilizes the ak.join.inner_join_merge and the ak.join.right_join_merge functions to return a merged Arkouda DataFrame object containing rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters). For this function self is considered the left dataframe. Based on pandas merge functionality. https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/merge.py#L137
- Parameters:
right (DataFrame) – The Right DataFrame to be joined
on (str) – The name of the DataFrame column the join is being performed on
how (str) – The merge condition. Must be “inner”, “left”, or “right”
left_suffix (str = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”
right_suffix (str = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”
- Returns:
Joined Arkouda DataFrame
- Return type:
- register(user_defined_name: str) DataFrame#
Register this DataFrame object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the DataFrame is to be registered under, this will be the root name for underlying components
- Returns:
The same DataFrame which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different DataFrames with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the DataFrame with the user_defined_name
See also
unregister,attach,unregister_dataframe_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
Any changes made to a DataFrame object after registering with the server may not be reflected in attached copies.
- unregister()#
Unregister this DataFrame object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
register,attach,unregister_dataframe_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() bool#
Return True if the object is contained in the registry
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static attach(user_defined_name: str) DataFrame#
Function to return a DataFrame object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which DataFrame object was registered under
- Returns:
The DataFrame object created by re-attaching to the corresponding server components
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
- static unregister_dataframe_by_name(user_defined_name: str) None#
Function to unregister DataFrame object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the DataFrame object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- classmethod from_return_msg(rep_msg)#
- arkouda.sorted(df, column=False)#
Analogous to other python ‘sorted(obj)’ functions in that it returns a sorted copy of the DataFrame.
If no sort key is specified, sort by the first key returned.
Note: This fails on sorting ak.Strings, as does DataFrame.sort().
- Parameters:
df (ak.dataframe.DataFrame) – The DataFrame to sort.
column (str) – The name of the column to sort by.
- Returns:
A sorted copy of the original DataFrame.
- Return type:
ak.dataframe.DataFrame
- arkouda.intersect(a, b, positions=True, unique=False)#
Find the intersection of two arkouda arrays.
This function can be especially useful when positions=True so that the caller gets the indices of values present in both arrays.
- Parameters:
a (ak.Strings or ak.pdarray) – An array of strings
b (ak.Strings or ak.pdarray) – An array of strings
positions (bool (default=True)) – Return tuple of boolean pdarrays that indicate positions in a and b where the values are in the intersection.
unique (bool (default=False)) – If the number of distinct values in a (and b) is equal to the size of a (and b), there is a more efficient method to compute the intersection.
- Returns:
The indices of a and b where any element occurs at least once in both arrays.
- Return type:
(ak.pdarray, ak.pdarray)
- arkouda.invert_permutation(perm)#
Find the inverse of a permutation array.
- Parameters:
perm (ak.pdarray) – The permutation array.
- Returns:
The inverse of the permutation array.
- Return type:
ak.pdarray
- arkouda.intx(a, b)#
Find all the rows that are in both dataframes. Columns should be in identical order.
Note: does not work for columns of floating point values, but does work for Strings, pdarrays of int64 type, and Categorical should work.
- arkouda.inner_join_merge(left: DataFrame, right: DataFrame, on: str, left_suffix: str = '_x', right_suffix: str = '_y') DataFrame#
Utilizes the ak.join.inner_join function to return an ak DataFrame object containing only rows that are in both the left and right Dataframes, (based on the “on” param), as well as their associated values. :param left: The Left DataFrame to be joined :type left: DataFrame :param right: The Right DataFrame to be joined :type right: DataFrame :param on: The name of the DataFrame column the join is being
performed on
- Parameters:
left_suffix (str = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”
right_suffix (str = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”
- Returns:
Inner-Joined Arkouda DataFrame
- Return type:
- arkouda.right_join_merge(left: DataFrame, right: DataFrame, on: str, left_suffix: str = '_x', right_suffix: str = '_y') DataFrame#
Utilizes the ak.join.inner_join_merge function to return an ak DataFrame object containing all the rows in the right Dataframe, as well as corresponding rows in the left (based on the “on” param), and all of their associated values. Based on pandas merge functionality. :param left: The Left DataFrame to be joined :type left: DataFrame :param right: The Right DataFrame to be joined :type right: DataFrame :param on: The name of the DataFrame column the join is being
performed on
- Parameters:
left_suffix (str = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”
right_suffix (str = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”
- Returns:
Right-Joined Arkouda DataFrame
- Return type:
- arkouda.merge(left: DataFrame, right: DataFrame, on: str, how: str, left_suffix: str = '_x', right_suffix: str = '_y') DataFrame#
Utilizes the ak.join.inner_join_merge and the ak.join.right_join_merge functions to return a merged Arkouda DataFrame object containing rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters). Based on pandas merge functionality. https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/merge.py#L137 :param left: The Left DataFrame to be joined :type left: DataFrame :param right: The Right DataFrame to be joined :type right: DataFrame :param on: The name of the DataFrame column the join is being
performed on
- Parameters:
how (str) – The merge condition. Must be “inner”, “left”, or “right”
left_suffix (str = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”
right_suffix (str = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”
- Returns:
Joined Arkouda DataFrame
- Return type:
- class arkouda.Row(dict=None, /, **kwargs)#
Bases:
collections.UserDictThis class is useful for printing and working with individual rows of a of an aku.DataFrame.
- arkouda.akbool#
- arkouda.akfloat64#
- arkouda.akint64#
- class arkouda.GroupBy(keys: groupable | None = None, assume_sorted: bool = False, **kwargs)#
Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.
- Parameters:
keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row
assume_sorted (bool) – If True, assume keys is already sorted (Default: False)
- nkeys#
The number of key arrays (columns)
- Type:
int
- size#
The length of the input array(s), i.e. number of rows
- Type:
int
- unique_keys#
The unique values of the keys array(s), in grouped order
- Type:
(list of) pdarray, Strings, or Categorical
- ngroups#
The length of the unique_keys array(s), i.e. number of groups
- Type:
int
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
- Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that groups the array
If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.
- Reductions#
- objType = 'GroupBy'#
- static from_return_msg(rep_msg)#
- to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')#
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Returns:
None
GroupBy is not currently supported by Parquet
- update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)#
- size() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
See also
Notes
This alias for “count” was added to conform with Pandas API
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.size() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- count() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- aggregate(values: groupable, operator: str, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, groupable]#
Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.
- Parameters:
values (pdarray) – The values to group and reduce
operator (str) – The name of the reduction operator to use
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
aggregates (groupable) – One aggregate value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if the requested operator is not supported for the values dtype
Examples
>>> keys = ak.arange(0, 10) >>> vals = ak.linspace(-1, 1, 10) >>> g = ak.GroupBy(keys) >>> g.aggregate(vals, 'sum') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768, -0.55555555555555536, -0.33333333333333348, -0.11111111111111116, 0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768, 1])) >>> g.aggregate(vals, 'min') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779, -0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116, 0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
- sum(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.
- Parameters:
values (pdarray) – The values to group and sum
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_sums (pdarray) – One sum per unique key in the GroupBy instance
skipna (bool) – boolean which determines if NANs should be skipped
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The grouped sum of a boolean
pdarrayreturns integers.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.sum(b) (array([2, 3, 4]), array([8, 14, 6]))
- prod(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.
- Parameters:
values (pdarray) – The values to group and multiply
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_products (pdarray, float64) – One product per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if prod is not supported for the values dtype
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.prod(b) (array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
- var(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.
- Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean, i.e.,
var = mean((x - x.mean())**2).The mean is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of a hypothetical infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.var(b) (array([2 3 4]), array([2.333333333333333 1.2 0]))
- std(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.
- Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,
std = sqrt(mean((x - x.mean())**2)).The average squared deviation is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of the infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1, it will not be an unbiased estimate of the standard deviation per se.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.std(b) (array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
- mean(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.
- Parameters:
values (pdarray) – The values to group and average
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.mean(b) (array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
- median(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.
- Parameters:
values (pdarray) – The values to group and find median
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,9) >>> a array([4 1 4 3 2 2 2 3 3]) >>> g = ak.GroupBy(a) >>> g.keys array([4 1 4 3 2 2 2 3 3]) >>> b = ak.linspace(-5,5,9) >>> b array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5]) >>> g.median(b) (array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
- min(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find minima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_minima (pdarray) – One minimum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if min is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.min(b) (array([2, 3, 4]), array([1, 1, 3]))
- max(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find maxima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_maxima (pdarray) – One maximum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if max is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.max(b) (array([2, 3, 4]), array([4, 4, 3]))
- argmin(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmin
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmin(b) (array([2, 3, 4]), array([5, 4, 2]))
- argmax(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmax
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmax(b) (array([2, 3, 4]), array([9, 3, 2]))
- nunique(values: groupable) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.
- Parameters:
values (pdarray, int64) – The values to group and find unique values
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if nunique is not supported for the values dtype
Examples
>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> data array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> labels ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g = ak.GroupBy(labels) >>> g.keys ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g.nunique(data) array([1,2,3,4]), array([2, 2, 3, 1]) # Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4 # Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4 # Group (3,3,3) has values [3,4,1] -> 3 unique values # Group (4) has values [4] -> 1 unique value
- any(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “or”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
- all(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “and”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- OR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise OR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with OR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- AND(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise AND of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with AND
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- XOR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise XOR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with XOR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- first(values: groupable_element_type) Tuple[groupable, groupable_element_type]#
First value in each group.
- Parameters:
values (pdarray-like) – The values from which to take the first of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first value of each group
- mode(values: groupable) Tuple[groupable, groupable]#
Most common value in each group. If a group is multi-modal, return the modal value that occurs first.
- Parameters:
values ((list of) pdarray-like) – The values from which to take the mode of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) pdarray-like) – The most common value of each group
- unique(values: groupable)#
Return the set of unique values in each group, as a SegArray.
- Parameters:
values ((list of) pdarray-like) – The values to unique
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) SegArray) – The unique values of each group
- Raises:
TypeError – Raised if values is or contains Strings or Categorical
- broadcast(values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, permute: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Fill each group’s segment with a constant value.
- Parameters:
- Returns:
The broadcasted values
- Return type:
- Raises:
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one value per segment
Notes
This function is a sparse analog of
np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.Examples
>>> a = ak.array([0, 1, 0, 1, 0]) >>> values = ak.array([3, 5]) >>> g = ak.GroupBy(a) # By default, result is in original order >>> g.broadcast(values) array([3, 5, 3, 5, 3]) # With permute=False, result is in grouped order >>> g.broadcast(values, permute=False) array([3, 3, 3, 5, 5] >>> a = ak.randint(1,5,10) >>> a array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> g.broadcast(counts > 2) array([True False True True True False True True False False]) >>> g.broadcast(counts == 3) array([True False True True True False True True False False]) >>> g.broadcast(counts < 4) array([True True True True True True True True True True])
- static build_from_components(user_defined_name: str = None, **kwargs) GroupBy#
function to build a new GroupBy object from component keys and permutation.
- Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
- Returns:
The GroupBy object created by using the given components
- Return type:
- register(user_defined_name: str) GroupBy#
Register this GroupBy object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components
- Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the GroupBy with the user_defined_name
See also
unregister,attach,unregister_groupby_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() bool#
Return True if the object is contained in the registry
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static attach(user_defined_name: str) GroupBy#
Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which GroupBy object was registered under
- Returns:
The GroupBy object created by re-attaching to the corresponding server components
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
register,is_registered,unregister,unregister_groupby_by_name
- static unregister_groupby_by_name(user_defined_name: str) None#
Function to unregister GroupBy object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the GroupBy object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- most_common(values)#
(Deprecated) See GroupBy.mode().
- arkouda.unique(pda: groupable, return_groups: bool = False, assume_sorted: bool = False, return_indices: bool = False) groupable | Tuple[groupable, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, int]#
Find the unique elements of an array.
Returns the unique elements of an array, sorted if the values are integers. There is an optional output in addition to the unique elements: the number of times each unique value comes up in the input array.
- Parameters:
pda ((list of) pdarray, Strings, or Categorical) – Input array.
return_groups (bool, optional) – If True, also return grouping information for the array.
return_indices (bool, optional) – Only applicable if return_groups is True. If True, return unique key indices along with other groups
assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step
- Returns:
unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.
permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)
segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)
- Raises:
TypeError – Raised if pda is not a pdarray or Strings object
RuntimeError – Raised if the pdarray or Strings dtype is unsupported
Notes
For integer arrays, this function checks to see whether pda is sorted and, if so, whether it is already unique. This step can save considerable computation. Otherwise, this function will sort pda.
Examples
>>> A = ak.array([3, 2, 1, 1, 2, 3]) >>> ak.unique(A) array([1, 2, 3])
- arkouda.akcast(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, dt: numpy.dtype | type | str | arkouda.dtypes.BigInt, errors: ErrorMode = ErrorMode.strict) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical | Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Cast an array to another dtype.
- Parameters:
dt (np.dtype, type, or str) – The target dtype to cast values to
errors ({strict, ignore, return_validity}) –
Controls how errors are handled when casting strings to a numeric type (ignored for casts from numeric types).
strict: raise RuntimeError if any string cannot be converted
- ignore: never raise an error. Uninterpretable strings get
converted to NaN (float64), -2**63 (int64), zero (uint64 and uint8), or False (bool)
return_validity: in addition to returning the same output as “ignore”, also return a bool array indicating where the cast was successful.
- Returns:
pdarray or Strings – Array of values cast to desired dtype
[validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is returned with True where the cast succeeded and False where it failed.
Notes
The cast is performed according to Chapel’s casting rules and is NOT safe from overflows or underflows. The user must ensure that the target dtype has the precision and capacity to hold the desired result.
Examples
>>> ak.cast(ak.linspace(1.0,5.0,5), dt=ak.int64) array([1, 2, 3, 4, 5])
>>> ak.cast(ak.arange(0,5), dt=ak.float64).dtype dtype('float64')
>>> ak.cast(ak.arange(0,5), dt=ak.bool) array([False, True, True, True, True])
>>> ak.cast(ak.linspace(0,4,5), dt=ak.bool) array([False, True, True, True, True])
- exception arkouda.RegistrationError#
Bases:
ExceptionError/Exception used when the Arkouda Server cannot register an object
- class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.dtypes.int_scalars, ndim: arkouda.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.dtypes.int_scalars, max_bits: int | None = None)#
The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.
- name#
The server-side identifier for the array
- Type:
str
- dtype#
The element type of the array
- Type:
dtype
- size#
The number of elements in the array
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
A list or tuple containing the sizes of each dimension of the array
- Type:
Sequence[int]
- itemsize#
The size in bytes of each element
- Type:
int_scalars
- property max_bits#
- BinOps#
- OpEqOps#
- objType = 'pdarray'#
- format_other(other) str#
Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.
- Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
- Return type:
string representation of np.dtype corresponding to the other parameter
- Raises:
TypeError – Raised if the other parameter cannot be converted to Numpy dtype
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a pdarray to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- opeq(other, op)#
- fill(value: arkouda.dtypes.numeric_scalars) None#
Fill the array (in place) with a constant value.
- Parameters:
value (numeric_scalars) –
- Raises:
TypeError – Raised if value is not an int, int64, float, or float64
- any() numpy.bool_#
Return True iff any element of the array evaluates to True.
- all() numpy.bool_#
Return True iff all elements of the array evaluate to True.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
Note
This will return True if the object is registered itself or as a component of another object
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- is_sorted() numpy.bool_#
Return True iff the array is monotonically non-decreasing.
- Parameters:
None –
- Returns:
Indicates if the array is monotonically non-decreasing
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- sum() arkouda.dtypes.numeric_and_bool_scalars#
Return the sum of all elements in the array.
- prod() numpy.float64#
Return the product of all elements in the array. Return value is always a np.float64 or np.int64.
- min() arkouda.dtypes.numpy_scalars#
Return the minimum value of the array.
- max() arkouda.dtypes.numpy_scalars#
Return the maximum value of the array.
- argmin() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array min value
- argmax() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array max value.
- mean() numpy.float64#
Return the mean of the array.
- var(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the variance. See
arkouda.varfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
The scalar variance of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown
- std(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the standard deviation. See
arkouda.stdfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
The scalar standard deviation of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- cov(y: pdarray) numpy.float64#
Compute the covariance between self and y.
- Parameters:
y (pdarray) – Other pdarray used to calculate covariance
- Returns:
The scalar covariance of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- corr(y: pdarray) numpy.float64#
Compute the correlation between self and y using pearson correlation coefficient.
- Parameters:
y (pdarray) – Other pdarray used to calculate correlation
- Returns:
The scalar correlation of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- mink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- maxk(k: arkouda.dtypes.int_scalars) pdarray#
Compute the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmaxk(k: arkouda.dtypes.int_scalars) pdarray#
Finds the indices corresponding to the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- value_counts()#
Count the occurrences of the unique values of self.
- Returns:
unique_values (pdarray) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs
Examples
>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts() (array([0, 2, 4]), array([3, 2, 1]))
- astype(dtype) pdarray#
Cast values of pdarray to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- slice_bits(low, high) pdarray#
Returns a pdarray containing only bits from low to high of self.
This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)
- Parameters:
low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0
high (int) – The highest bit included in the slice (inclusive)
- Returns:
A new pdarray containing the bits of self from low to high
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> p = ak.array([2**65 + (2**64 - 1)]) >>> bin(p[0]) '0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0]) '0b10'
- bigint_to_uint_arrays() List[pdarray]#
Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Returns:
A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Return type:
List[pdarrays]
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> a = ak.arange(2**64, 2**64 + 5) >>> a array(["18446744073709551616" "18446744073709551617" "18446744073709551618" "18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays() [array([1 1 1 1 1]), array([0 1 2 3 4])]
- reshape(*shape, order='row_major')#
Gives a new shape to an array without changing its data.
- Parameters:
shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.
order (str {'row_major' | 'C' | 'column_major' | 'F'}) – Read the elements of the pdarray in this index order By default, read the elements in row_major or C-like order where the last index changes the fastest If ‘column_major’ or ‘F’, read the elements in column_major or Fortran-like order where the first index changes the fastest
- Returns:
An arrayview object with the data from the array but with the new shape
- Return type:
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same attributes and data as the pdarray
- Return type:
np.ndarray
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_ndarray() array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray()) numpy.ndarray
- to_list() List#
Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A list with the same data as the pdarray
- Return type:
list
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_list() [0, 1, 2, 3, 4]
>>> type(a.to_list()) list
- to_cuda()#
Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.
- Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
- Return type:
numba.DeviceNDArray
- Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_cuda() array([0, 1, 2, 3, 4])
>>> type(a.to_cuda()) numpy.devicendarray
- to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str#
Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_parquet('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_parqet('path/prefix.parquet', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str#
Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_hdf('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_hdf('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving to a single file >>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single') Saves the array in to single hdf5 file on the root node. ``cwd/path/name_prefix.hdf5``
- update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)#
Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)#
Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.
- prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- dataset: str
Column name to save the pdarray under. Defaults to “array”.
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
str reponse message
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string
See also
save_all,load,read,to_parquet,to_hdfNotes
The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. Previously all files saved in Parquet format were saved with a.parquetfile extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.save('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.save('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving with an extension (Parquet) >>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet') Saves the array in numLocales Parquet files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- register(user_defined_name: str) pdarray#
Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.
- Parameters:
user_defined_name (str) – user defined name array is to be registered under
- Returns:
The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.
See also
attach,unregister,is_registered,list_registry,unregister_pdarray_by_nameNotes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- unregister() None#
Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- static attach(user_defined_name: str) pdarray#
class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which array was registered under
- Returns:
pdarray which is bound to the corresponding server side component which was registered with user_defined_name
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- arkouda.arange(*args, **kwargs) arkouda.pdarrayclass.pdarray#
arange([start,] stop[, stride,] dtype=int64)
Create a pdarray of consecutive integers within the interval [start, stop). If only one arg is given then arg is the stop parameter. If two args are given, then the first arg is start and second is stop. If three args are given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
- Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stop (int_scalars) – Stopping value (exclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1, if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Integers from start (inclusive) to stop (exclusive) by stride
- Return type:
pdarray, dtype
- Raises:
TypeError – Raised if start, stop, or stride is not an int object
ZeroDivisionError – Raised if stride == 0
Notes
Negative strides result in decreasing values. Currently, only int64 pdarrays can be created with this method. For float64 arrays, use the linspace method.
Examples
>>> ak.arange(0, 5, 1) array([0, 1, 2, 3, 4])
>>> ak.arange(5, 0, -1) array([5, 4, 3, 2, 1])
>>> ak.arange(0, 10, 2) array([0, 2, 4, 6, 8])
>>> ak.arange(-5, -10, -1) array([-5, -6, -7, -8, -9])
- arkouda.array(a: arkouda.pdarrayclass.pdarray | numpy.ndarray | Iterable, dtype: numpy.dtype | type | str = None, max_bits: int = -1) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Convert a Python or Numpy Iterable to a pdarray or Strings object, sending the corresponding data to the arkouda server.
- Parameters:
a (Union[pdarray, np.ndarray]) – Rank-1 array of a supported dtype
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
A pdarray instance stored on arkouda server or Strings instance, which is composed of two pdarrays stored on arkouda server
- Return type:
- Raises:
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a list, array, tuple, or deque
RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is not supported (not in DTypes), or if the product of a size and a.itemsize > maxTransferBytes
ValueError – Raised if the returned message is malformed or does not contain the fields required to generate the array.
See also
Notes
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overwhelming the connection between the Python client and the arkouda server, under the assumption that it is a low-bandwidth connection. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively to create the Strings object and the two corresponding pdarrays for string bytes and offsets, respectively.
Examples
>>> ak.array(np.arange(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> ak.array(range(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> strings = ak.array([f'string {i}' for i in range(0,5)]) >>> type(strings) <class 'arkouda.strings.Strings'>
- arkouda.create_pdarray(repMsg: str, max_bits=None) pdarray#
Return a pdarray instance pointing to an array created by the arkouda server. The user should not call this function directly.
- Parameters:
repMsg (str) – space-delimited string containing the pdarray name, datatype, size dimension, shape,and itemsize
- Returns:
A pdarray with the same attributes and data as the pdarray; on GPU
- Return type:
- Raises:
ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance
RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance
- arkouda.ones(size: arkouda.dtypes.int_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray#
Create a pdarray filled with ones.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
dtype (Union[float64, int64, bool]) – Resulting array type, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Ones of the requested size and dtype
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
Examples
>>> ak.ones(5, dtype=ak.int64) array([1, 1, 1, 1, 1])
>>> ak.ones(5, dtype=ak.float64) array([1, 1, 1, 1, 1])
>>> ak.ones(5, dtype=ak.bool) array([True, True, True, True, True])
- arkouda.in1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False, symmetric: bool = False, invert: bool = False) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable#
Test whether each element of a 1-D array is also present in a second array.
Returns a boolean array the same length as pda1 that is True where an element of pda1 is in pda2 and False otherwise.
Support multi-level – test membership of rows of a in the set of rows of b.
- Parameters:
a (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements for which to test membership in b
b (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements of the set in which to test membership
assume_unique (bool) – If true, assume rows of a and b are each unique and sorted. By default, sort and unique them explicitly.
symmetric (bool) – Return in1d(pda1, pda2), in1d(pda2, pda1) when pda1 and 2 are single items.
invert (bool, optional) – If True, the values in the returned array are inverted (that is, False where an element of pda1 is in pda2 and True otherwise). Default is False.
ak.in1d(a, b, invert=True)is equivalent to (but is faster than)~ak.in1d(a, b).
- Return type:
True for each row in a that is contained in b
Return Type#
pdarray, bool
Notes
Only works for pdarrays of int64 dtype, Strings, or Categorical
- arkouda.convert_if_categorical(values)#
Convert a Categorical array to Strings for display
- arkouda.generic_concat(items, ordered=True)#
- arkouda.get_callback(x)#
- class arkouda.Index(values: List | arkouda.pdarrayclass.pdarray | arkouda.Strings | arkouda.Categorical | pandas.Index | Index, name: str | None = None)#
- property index#
This is maintained to support older code
- property shape#
- property is_unique#
Property indicating if all values in the index are unique :rtype: bool - True if all values are unique, False otherwise.
- objType = 'Index'#
- static factory(index)#
- classmethod from_return_msg(rep_msg)#
- to_pandas()#
- to_ndarray()#
- to_list()#
- set_dtype(dtype)#
Change the data type of the index
Currently only aku.ip_address and ak.array are supported.
- register(user_defined_name)#
Register this Index object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Index is to be registered under, this will be the root name for underlying components
- Returns:
The same Index which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Indexes with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Index with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this Index object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered()#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- to_dict(label)#
- argsort(ascending=True)#
- concat(other)#
- lookup(key)#
- to_hdf(prefix_path: str, dataset: str = 'index', mode: str = 'truncate', file_type: str = 'distribute') str#
Save the Index to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- update_hdf(prefix_path: str, dataset: str = 'index', repack: bool = True)#
Overwrite the dataset with the name provided with this Index object. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the index
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
- to_parquet(prefix_path: str, dataset: str = 'index', mode: str = 'truncate', compression: str | None = None)#
Save the Index to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- to_csv(prefix_path: str, dataset: str = 'index', col_delim: str = ',', overwrite: bool = False)#
Write Index to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.
- prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- dataset: str
Column name to save the pdarray under. Defaults to “array”.
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
str reponse message
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- save(prefix_path: str, dataset: str = 'index', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the index to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string
See also
save_all,load,read,to_parquet,to_hdfNotes
The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. Previously all files saved in Parquet format were saved with a.parquetfile extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used. The file I/O does not rely on the extension to determine the file format.
- class arkouda.MultiIndex(values)#
Bases:
Index- property index#
This is maintained to support older code
- objType = 'MultiIndex'#
- to_pandas()#
- set_dtype(dtype)#
Change the data type of the index
Currently only aku.ip_address and ak.array are supported.
- to_ndarray()#
- to_list()#
- register(user_defined_name)#
Register this Index object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Index is to be registered under, this will be the root name for underlying components
- Returns:
The same Index which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Indexes with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Index with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this Index object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered()#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- to_dict(labels)#
- argsort(ascending=True)#
- concat(other)#
- lookup(key)#
- to_hdf(prefix_path: str, dataset: str = 'index', mode: str = 'truncate', file_type: str = 'distribute') str#
Save the Index to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- update_hdf(prefix_path: str, dataset: str = 'index', repack: bool = True)#
Overwrite the dataset with the name provided with this Index object. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the index
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
- class arkouda.Series(data: Tuple | List | arkouda.groupbyclass.groupable_element_type, index: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Tuple | List | arkouda.index.Index | None = None)#
One-dimensional arkouda array with axis labels.
- Parameters:
- Raises:
TypeError – Raised if index is not a pdarray or Strings object Raised if data is not a pdarray, Strings, or Categorical object
ValueError – Raised if the index size does not match data size
Notes
The Series class accepts either positional arguments or keyword arguments. If entering positional arguments,
- 2 arguments entered:
argument 1 - data argument 2 - index
- 1 argument entered:
argument 1 - data
If entering 1 positional argument, it is assumed that this is the data argument. If only ‘data’ argument is passed in, Index will automatically be generated. If entering keywords,
‘data’ (see Parameters) ‘index’ (optional) must match size of ‘data’
- property shape#
- objType = 'Series'#
- dt#
- str_acc#
- isin(lst: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | List) Series#
Find series elements whose values are in the specified list
Input#
Either a python list or an arkouda array.
- rtype:
Arkouda boolean which is true for elements that are in the list and false otherwise.
- locate(key: int | arkouda.pdarrayclass.pdarray | arkouda.index.Index | Series | List | Tuple) Series#
Lookup values by index label
The input can be a scalar, a list of scalers, or a list of lists (if the series has a MultiIndex). As a special case, if a Series is used as the key, the series labels are preserved with its values use as the key.
Keys will be turned into arkouda arrays as needed.
- Return type:
A Series containing the values corresponding to the key.
- topn(n: int = 10) Series#
Return the top values of the series
- Parameters:
n (Number of values to return) –
- Return type:
A new Series with the top values
- sort_index(ascending: bool = True) Series#
Sort the series by its index
- Return type:
A new Series sorted.
- sort_values(ascending: bool = True) Series#
Sort the series numerically
- Return type:
A new Series sorted smallest to largest
- to_pandas() pandas.Series#
Convert the series to a local PANDAS series
- to_list() list#
- value_counts(sort: bool = True) Series#
Return a Series containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
- Parameters:
sort (Boolean. Whether or not to sort the results. Default is true.) –
- diff() Series#
Diffs consecutive values of the series.
Returns a new series with the same index and length. First value is set to NaN.
- to_dataframe(index_labels: List[str] = None, value_label: str = None) arkouda.dataframe.DataFrame#
Converts series to an arkouda data frame
- Parameters:
index_labels (column names(s) to label the index.) –
value_label (column name to label values.) –
- Return type:
An arkouda dataframe.
- register(user_defined_name: str)#
Register this Series object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Series is to be registered under, this will be the root name for underlying components
- Returns:
The same Series which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Series with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Series with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this Series object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static attach(label: str, nkeys: int = 1) Series#
DEPRECATED Retrieve a series registered with arkouda
- Parameters:
label (name used to register the series) –
nkeys (number of keys, if a multi-index was registerd) –
- is_registered() bool#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- classmethod from_return_msg(repMsg: str) Series#
Return a Series instance pointing to components created by the arkouda server. The user should not call this function directly.
- Parameters:
repMsg (str) –
delimited string containing the values and indexes
- Returns:
A Series representing a set of pdarray components on the server
- Return type:
- Raises:
RuntimeError – Raised if a server-side error is thrown in the process of creating the Series instance
- static concat(arrays: List, axis: int = 0, index_labels: List[str] = None, value_labels: List[str] = None) arkouda.dataframe.DataFrame | Series#
Concatenate in arkouda a list of arkouda Series or grouped arkouda arrays horizontally or vertically. If a list of grouped arkouda arrays is passed they are converted to a series. Each grouping is a 2-tuple with the first item being the key(s) and the second being the value. If horizontal, each series or grouping must have the same length and the same index. The index of the series is converted to a column in the dataframe. If it is a multi-index,each level is converted to a column.
- Parameters:
arrays (The list of series/groupings to concat.) –
axis (Whether or not to do a verticle (axis=0) or horizontal (axis=1) concatenation) –
index_labels (column names(s) to label the index.) –
value_labels (column names to label values of each series.) –
- Returns:
axis=0 (an arkouda series.)
axis=1 (an arkouda dataframe.)
- static pdconcat(arrays: List, axis: int = 0, labels: arkouda.strings.Strings = None) pandas.Series | pandas.DataFrame#
Concatenate a list of arkouda Series or grouped arkouda arrays, returning a PANDAS object.
If a list of grouped arkouda arrays is passed they are converted to a series. Each grouping is a 2-tuple with the first item being the key(s) and the second being the value.
If horizontal, each series or grouping must have the same length and the same index. The index of the series is converted to a column in the dataframe. If it is a multi-index,each level is converted to a column.
- Parameters:
arrays (The list of series/groupings to concat.) –
axis (Whether or not to do a verticle (axis=0) or horizontal (axis=1) concatenation) –
labels (names to give the columns of the data frame.) –
- Returns:
axis=0 (a local PANDAS series)
axis=1 (a local PANDAS dataframe)
- class arkouda.Categorical(values, **kwargs)#
Represents an array of values belonging to named categories. Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.
- Parameters:
values (Strings) – String values to convert to categories
NAvalue (str scalar) – The value to use to represent missing/null data
- permutation#
The permutation that groups the values in the same order as categories
- Type:
pdarray, int64
- size#
The number of items in the array
- Type:
Union[int,np.int64]
- nlevels#
The number of distinct categories
- Type:
Union[int,np.int64]
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
Union[int,np.int64]
- shape#
The sizes of each dimension of the array
- Type:
tuple
- BinOps#
- RegisterablePieces#
- RequiredPieces#
- permutation#
- segments#
- objType = 'Categorical'#
- dtype#
- classmethod from_codes(codes: arkouda.pdarrayclass.pdarray, categories: arkouda.strings.Strings, permutation=None, segments=None, **kwargs) Categorical#
Make a Categorical from codes and categories arrays. If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.
- Parameters:
- Returns:
The Categorical object created from the input parameters
- Return type:
- Raises:
TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object
- classmethod from_return_msg(rep_msg) Categorical#
Create categorical from return message from server
Notes
This is currently only used when reading a Categorical from HDF5 files.
- classmethod standardize_categories(arrays, NAvalue='N/A')#
Standardize an array of Categoricals so that they share the same categories.
- Parameters:
arrays (sequence of Categoricals) – The Categoricals to standardize
NAvalue (str scalar) – The value to use to represent missing/null data
- Returns:
A list of the original Categoricals remapped to the shared categories.
- Return type:
List of Categoricals
- set_categories(new_categories, NAvalue=None)#
Set categories to user-defined values.
- Parameters:
new_categories (Strings) – The array of new categories to use. Must be unique.
NAvalue (str scalar) – The value to use to represent missing/null data
- Returns:
A new Categorical with the user-defined categories. Old values present in new categories will appear unchanged. Old values not present will be assigned the NA value.
- Return type:
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray of strings corresponding to the values in this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.
- to_list() List#
Convert the Categorical to a list, transferring data from the arkouda server to Python. This conversion discards category information and produces a list of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list of strings corresponding to the values in this Categorical
- Return type:
list
Notes
The number of bytes in the Categorical cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.
- isna()#
Find where values are missing or null (as defined by self.NAvalue)
- reset_categories() Categorical#
Recompute the category labels, discarding any unused labels. This method is often useful after slicing or indexing a Categorical array, when the resulting array only contains a subset of the original categories. In this case, eliminating unused categories can speed up other operations.
- Returns:
A Categorical object generated from the current instance
- Return type:
- contains(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element contains the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- startswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- endswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- in1d(test: arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray#
Test whether each element of the Categorical object is also present in the test Strings or Categorical object.
Returns a boolean array the same length as self that is True where an element of self is in test and False otherwise.
- Parameters:
test (Union[Strings,Categorical]) – The values against which to test each value of ‘self`.
- Returns:
The values self[in1d] are in the test Strings or Categorical object.
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if test is not a Strings or Categorical object
See also
Notes
in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences.
in1d(a, b)is logically equivalent toak.array([item in b for item in a]), but is much faster and scales to arbitrarily largea.Examples
>>> strings = ak.array([f'String {i}' for i in range(0,5)]) >>> cat = ak.Categorical(strings) >>> ak.in1d(cat,strings) array([True, True, True, True, True]) >>> strings = ak.array([f'String {i}' for i in range(5,9)]) >>> catTwo = ak.Categorical(strings) >>> ak.in1d(cat,catTwo) array([False, False, False, False, False])
- unique() Categorical#
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Compute a 128-bit hash of each element of the Categorical.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- group() arkouda.pdarrayclass.pdarray#
Return the permutation that groups the array, placing equivalent categories together. All instances of the same category are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
Notes
This method is faster than the corresponding Strings method. If the Categorical was created from a Strings object, then this function simply returns the cached permutation. Even if the Categorical was created using from_codes(), this function will be faster than Strings.group() because it sorts dense integer values, rather than 128-bit hash values.
- argsort()#
- sort()#
- concatenate(others: Sequence[Categorical], ordered: bool = True) Categorical#
Merge this Categorical with other Categorical objects in the array, concatenating the arrays and synchronizing the categories.
- Parameters:
others (Sequence[Categorical]) – The Categorical arrays to concatenate and merge with this one
ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.
- Returns:
The merged Categorical object
- Return type:
- Raises:
TypeError – Raised if any others array objects are not Categorical objects
Notes
This operation can be expensive – slower than concatenating Strings.
- to_hdf(prefix_path, dataset='categorical_array', mode='truncate', file_type='distribute')#
Save the Categorical to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale.
- Return type:
None
See also
- update_hdf(prefix_path, dataset='categorical_array', repack=True)#
Overwrite the dataset with the name provided with this Categorical object. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Categorical
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, the repack option allows for automatic creation of a file without the inaccessible data.
- to_parquet(prefix_path: str, dataset: str = 'categorical_array', mode: str = 'truncate', compression: str | None = None) str#
This functionality is currently not supported and will also raise a RuntimeError. Support is in development. Save the Categorical to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in HDF5 files (must not already exist)
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Categorical dataset within existing files.
compression (str (Optional)) – Default None Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – On run due to compatability issues of Categorical with Parquet.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.See also
- save(prefix_path: str, dataset: str = 'categorical_array', file_format: str = 'HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) str#
DEPRECATED Save the Categorical object to HDF5 or Parquet. The result is a collection of HDF5/Parquet files, one file per locale of the arkouda server, where each filename starts with prefix_path and dataset. Each locale saves its chunk of the Strings array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in HDF5 files (must not already exist) :type dataset: str :param file_format: The format to save the file to. :type file_format: str {‘HDF5 | ‘Parquet’} :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Categorical dataset within existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
compression (str (Optional)) – {None | ‘snappy’ | ‘gzip’ | ‘brotli’ | ‘zstd’ | ‘lz4’} The compression type to use when writing. This is only supported for Parquet files and will not be used with HDF5.
- Return type:
String message indicating result of save operation
- Raises:
ValueError – Raised if the lengths of columns and values differ, or the mode is neither ‘truncate’ nor ‘append’
TypeError – Raised if prefix_path, dataset, or mode is not a str
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter.
See also
-,-
- register(user_defined_name: str) Categorical#
Register this Categorical object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Categorical is to be registered under, this will be the root name for underlying components
- Returns:
The same Categorical which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Categoricals with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Categorical with the user_defined_name
See also
unregister,attach,unregister_categorical_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister() None#
Unregister this Categorical object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
register,attach,unregister_categorical_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
register,attach,unregister,unregister_categorical_by_nameNotes
Objects registered with the server are immune to deletion until they are unregistered.
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- static attach(user_defined_name: str) Categorical#
DEPRECATED Function to return a Categorical object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which Categorical object was registered under
- Returns:
The Categorical object created by re-attaching to the corresponding server components
- Return type:
- Raises:
TypeError – if user_defined_name is not a string
- static unregister_categorical_by_name(user_defined_name: str) None#
Function to unregister Categorical object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the Categorical object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- static parse_hdf_categoricals(d: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings]) Tuple[List[str], Dict[str, Categorical]]#
This function should be used in conjunction with the load_all function which reads hdf5 files and reconstitutes Categorical objects. Categorical objects use a naming convention and HDF5 structure so they can be identified and constructed for the user.
In general you should not call this method directly
- Parameters:
d (Dictionary of String to either Pdarray or Strings object) –
- Returns:
2-Tuple of List of strings containing key names which should be removed and Dictionary of
base name to Categorical object
See also
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a Categorical object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Categorical is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.bigint#
- arkouda.akfloat64#
- arkouda.akint64#
- arkouda.akuint64#
- class arkouda.GroupBy(keys: groupable | None = None, assume_sorted: bool = False, **kwargs)#
Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.
- Parameters:
keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row
assume_sorted (bool) – If True, assume keys is already sorted (Default: False)
- nkeys#
The number of key arrays (columns)
- Type:
int
- size#
The length of the input array(s), i.e. number of rows
- Type:
int
- unique_keys#
The unique values of the keys array(s), in grouped order
- Type:
(list of) pdarray, Strings, or Categorical
- ngroups#
The length of the unique_keys array(s), i.e. number of groups
- Type:
int
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
- Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that groups the array
If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.
- Reductions#
- objType = 'GroupBy'#
- static from_return_msg(rep_msg)#
- to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')#
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Returns:
None
GroupBy is not currently supported by Parquet
- update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)#
- size() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
See also
Notes
This alias for “count” was added to conform with Pandas API
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.size() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- count() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- aggregate(values: groupable, operator: str, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, groupable]#
Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.
- Parameters:
values (pdarray) – The values to group and reduce
operator (str) – The name of the reduction operator to use
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
aggregates (groupable) – One aggregate value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if the requested operator is not supported for the values dtype
Examples
>>> keys = ak.arange(0, 10) >>> vals = ak.linspace(-1, 1, 10) >>> g = ak.GroupBy(keys) >>> g.aggregate(vals, 'sum') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768, -0.55555555555555536, -0.33333333333333348, -0.11111111111111116, 0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768, 1])) >>> g.aggregate(vals, 'min') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779, -0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116, 0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
- sum(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.
- Parameters:
values (pdarray) – The values to group and sum
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_sums (pdarray) – One sum per unique key in the GroupBy instance
skipna (bool) – boolean which determines if NANs should be skipped
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The grouped sum of a boolean
pdarrayreturns integers.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.sum(b) (array([2, 3, 4]), array([8, 14, 6]))
- prod(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.
- Parameters:
values (pdarray) – The values to group and multiply
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_products (pdarray, float64) – One product per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if prod is not supported for the values dtype
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.prod(b) (array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
- var(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.
- Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean, i.e.,
var = mean((x - x.mean())**2).The mean is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of a hypothetical infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.var(b) (array([2 3 4]), array([2.333333333333333 1.2 0]))
- std(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.
- Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,
std = sqrt(mean((x - x.mean())**2)).The average squared deviation is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of the infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1, it will not be an unbiased estimate of the standard deviation per se.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.std(b) (array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
- mean(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.
- Parameters:
values (pdarray) – The values to group and average
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.mean(b) (array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
- median(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.
- Parameters:
values (pdarray) – The values to group and find median
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,9) >>> a array([4 1 4 3 2 2 2 3 3]) >>> g = ak.GroupBy(a) >>> g.keys array([4 1 4 3 2 2 2 3 3]) >>> b = ak.linspace(-5,5,9) >>> b array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5]) >>> g.median(b) (array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
- min(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find minima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_minima (pdarray) – One minimum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if min is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.min(b) (array([2, 3, 4]), array([1, 1, 3]))
- max(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find maxima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_maxima (pdarray) – One maximum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if max is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.max(b) (array([2, 3, 4]), array([4, 4, 3]))
- argmin(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmin
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmin(b) (array([2, 3, 4]), array([5, 4, 2]))
- argmax(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmax
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmax(b) (array([2, 3, 4]), array([9, 3, 2]))
- nunique(values: groupable) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.
- Parameters:
values (pdarray, int64) – The values to group and find unique values
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if nunique is not supported for the values dtype
Examples
>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> data array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> labels ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g = ak.GroupBy(labels) >>> g.keys ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g.nunique(data) array([1,2,3,4]), array([2, 2, 3, 1]) # Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4 # Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4 # Group (3,3,3) has values [3,4,1] -> 3 unique values # Group (4) has values [4] -> 1 unique value
- any(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “or”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
- all(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “and”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- OR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise OR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with OR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- AND(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise AND of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with AND
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- XOR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise XOR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with XOR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- first(values: groupable_element_type) Tuple[groupable, groupable_element_type]#
First value in each group.
- Parameters:
values (pdarray-like) – The values from which to take the first of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first value of each group
- mode(values: groupable) Tuple[groupable, groupable]#
Most common value in each group. If a group is multi-modal, return the modal value that occurs first.
- Parameters:
values ((list of) pdarray-like) – The values from which to take the mode of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) pdarray-like) – The most common value of each group
- unique(values: groupable)#
Return the set of unique values in each group, as a SegArray.
- Parameters:
values ((list of) pdarray-like) – The values to unique
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) SegArray) – The unique values of each group
- Raises:
TypeError – Raised if values is or contains Strings or Categorical
- broadcast(values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, permute: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Fill each group’s segment with a constant value.
- Parameters:
- Returns:
The broadcasted values
- Return type:
- Raises:
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one value per segment
Notes
This function is a sparse analog of
np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.Examples
>>> a = ak.array([0, 1, 0, 1, 0]) >>> values = ak.array([3, 5]) >>> g = ak.GroupBy(a) # By default, result is in original order >>> g.broadcast(values) array([3, 5, 3, 5, 3]) # With permute=False, result is in grouped order >>> g.broadcast(values, permute=False) array([3, 3, 3, 5, 5] >>> a = ak.randint(1,5,10) >>> a array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> g.broadcast(counts > 2) array([True False True True True False True True False False]) >>> g.broadcast(counts == 3) array([True False True True True False True True False False]) >>> g.broadcast(counts < 4) array([True True True True True True True True True True])
- static build_from_components(user_defined_name: str = None, **kwargs) GroupBy#
function to build a new GroupBy object from component keys and permutation.
- Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
- Returns:
The GroupBy object created by using the given components
- Return type:
- register(user_defined_name: str) GroupBy#
Register this GroupBy object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components
- Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the GroupBy with the user_defined_name
See also
unregister,attach,unregister_groupby_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() bool#
Return True if the object is contained in the registry
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static attach(user_defined_name: str) GroupBy#
Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which GroupBy object was registered under
- Returns:
The GroupBy object created by re-attaching to the corresponding server components
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
register,is_registered,unregister,unregister_groupby_by_name
- static unregister_groupby_by_name(user_defined_name: str) None#
Function to unregister GroupBy object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the GroupBy object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- most_common(values)#
(Deprecated) See GroupBy.mode().
- arkouda.broadcast(segments: arkouda.pdarrayclass.pdarray, values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, size: int | numpy.int64 | numpy.uint64 = -1, permutation: arkouda.pdarrayclass.pdarray | None = None)#
Broadcast a dense column vector to the rows of a sparse matrix or grouped array.
- Parameters:
segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.
values (pdarray, Strings) – The values to broadcast, one per row (or group)
size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.
permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.
- Returns:
The broadcast values, one per nonzero
- Return type:
- Raises:
ValueError –
If segments and values are different sizes
If segments are empty
If number of nonzeros (either user-specified or inferred from permutation) is less than one
Examples
>>> # Define a sparse matrix with 3 rows and 7 nonzeros >>> row_starts = ak.array([0, 2, 5]) >>> nnz = 7 # Broadcast the row number to each nonzero element >>> row_number = ak.arange(3) >>> ak.broadcast(row_starts, row_number, nnz) array([0 0 1 1 1 2 2]) # If the original nonzeros were in reverse order... >>> permutation = ak.arange(6, -1, -1) >>> ak.broadcast(row_starts, row_number, permutation=permutation) array([2 2 1 1 1 0 0])
- arkouda.unique(pda: groupable, return_groups: bool = False, assume_sorted: bool = False, return_indices: bool = False) groupable | Tuple[groupable, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, int]#
Find the unique elements of an array.
Returns the unique elements of an array, sorted if the values are integers. There is an optional output in addition to the unique elements: the number of times each unique value comes up in the input array.
- Parameters:
pda ((list of) pdarray, Strings, or Categorical) – Input array.
return_groups (bool, optional) – If True, also return grouping information for the array.
return_indices (bool, optional) – Only applicable if return_groups is True. If True, return unique key indices along with other groups
assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step
- Returns:
unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.
permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)
segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)
- Raises:
TypeError – Raised if pda is not a pdarray or Strings object
RuntimeError – Raised if the pdarray or Strings dtype is unsupported
Notes
For integer arrays, this function checks to see whether pda is sorted and, if so, whether it is already unique. This step can save considerable computation. Otherwise, this function will sort pda.
Examples
>>> A = ak.array([3, 2, 1, 1, 2, 3]) >>> ak.unique(A) array([1, 2, 3])
- arkouda.where(condition: arkouda.pdarrayclass.pdarray, A: str | arkouda.dtypes.numeric_scalars | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, B: str | arkouda.dtypes.numeric_scalars | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical#
Returns an array with elements chosen from A and B based upon a conditioning array. As is the case with numpy.where, the return array consists of values from the first array (A) where the conditioning array elements are True and from the second array (B) where the conditioning array elements are False.
- Parameters:
condition (pdarray) – Used to choose values from A or B
A (Union[numeric_scalars, str, pdarray, Strings, Categorical]) – Value(s) used when condition is True
B (Union[numeric_scalars, str, pdarray, Strings, Categorical]) – Value(s) used when condition is False
- Returns:
Values chosen from A where the condition is True and B where the condition is False
- Return type:
- Raises:
TypeError – Raised if the condition object is not a pdarray, if A or B is not an int, np.int64, float, np.float64, pdarray, str, Strings, Categorical if pdarray dtypes are not supported or do not match, or multiple condition clauses (see Notes section) are applied
ValueError – Raised if the shapes of the condition, A, and B pdarrays are unequal
Examples
>>> a1 = ak.arange(1,10) >>> a2 = ak.ones(9, dtype=np.int64) >>> cond = a1 < 5 >>> ak.where(cond,a1,a2) array([1, 2, 3, 4, 1, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10) >>> a2 = ak.ones(9, dtype=np.int64) >>> cond = a1 == 5 >>> ak.where(cond,a1,a2) array([1, 1, 1, 1, 5, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10) >>> a2 = 10 >>> cond = a1 < 5 >>> ak.where(cond,a1,a2) array([1, 2, 3, 4, 10, 10, 10, 10, 10])
>>> s1 = ak.array([f'str {i}' for i in range(10)]) >>> s2 = 'str 21' >>> cond = (ak.arange(10) % 2 == 0) >>> ak.where(cond,s1,s2) array(['str 0', 'str 21', 'str 2', 'str 21', 'str 4', 'str 21', 'str 6', 'str 21', 'str 8','str 21'])
>>> c1 = ak.Categorical(ak.array([f'str {i}' for i in range(10)])) >>> c2 = ak.Categorical(ak.array([f'str {i}' for i in range(9, -1, -1)])) >>> cond = (ak.arange(10) % 2 == 0) >>> ak.where(cond,c1,c2) array(['str 0', 'str 8', 'str 2', 'str 6', 'str 4', 'str 4', 'str 6', 'str 2', 'str 8', 'str 0'])
Notes
A and B must have the same dtype and only one conditional clause is supported e.g., n < 5, n > 1, which is supported in numpy is not currently supported in Arkouda
- class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.dtypes.int_scalars, ndim: arkouda.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.dtypes.int_scalars, max_bits: int | None = None)#
The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.
- name#
The server-side identifier for the array
- Type:
str
- dtype#
The element type of the array
- Type:
dtype
- size#
The number of elements in the array
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
A list or tuple containing the sizes of each dimension of the array
- Type:
Sequence[int]
- itemsize#
The size in bytes of each element
- Type:
int_scalars
- property max_bits#
- BinOps#
- OpEqOps#
- objType = 'pdarray'#
- format_other(other) str#
Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.
- Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
- Return type:
string representation of np.dtype corresponding to the other parameter
- Raises:
TypeError – Raised if the other parameter cannot be converted to Numpy dtype
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a pdarray to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- opeq(other, op)#
- fill(value: arkouda.dtypes.numeric_scalars) None#
Fill the array (in place) with a constant value.
- Parameters:
value (numeric_scalars) –
- Raises:
TypeError – Raised if value is not an int, int64, float, or float64
- any() numpy.bool_#
Return True iff any element of the array evaluates to True.
- all() numpy.bool_#
Return True iff all elements of the array evaluate to True.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
Note
This will return True if the object is registered itself or as a component of another object
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- is_sorted() numpy.bool_#
Return True iff the array is monotonically non-decreasing.
- Parameters:
None –
- Returns:
Indicates if the array is monotonically non-decreasing
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- sum() arkouda.dtypes.numeric_and_bool_scalars#
Return the sum of all elements in the array.
- prod() numpy.float64#
Return the product of all elements in the array. Return value is always a np.float64 or np.int64.
- min() arkouda.dtypes.numpy_scalars#
Return the minimum value of the array.
- max() arkouda.dtypes.numpy_scalars#
Return the maximum value of the array.
- argmin() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array min value
- argmax() numpy.int64 | numpy.uint64#
Return the index of the first occurrence of the array max value.
- mean() numpy.float64#
Return the mean of the array.
- var(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the variance. See
arkouda.varfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
The scalar variance of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown
- std(ddof: arkouda.dtypes.int_scalars = 0) numpy.float64#
Compute the standard deviation. See
arkouda.stdfor details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
The scalar standard deviation of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- cov(y: pdarray) numpy.float64#
Compute the covariance between self and y.
- Parameters:
y (pdarray) – Other pdarray used to calculate covariance
- Returns:
The scalar covariance of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- corr(y: pdarray) numpy.float64#
Compute the correlation between self and y using pearson correlation coefficient.
- Parameters:
y (pdarray) – Other pdarray used to calculate correlation
- Returns:
The scalar correlation of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- mink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- maxk(k: arkouda.dtypes.int_scalars) pdarray#
Compute the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmink(k: arkouda.dtypes.int_scalars) pdarray#
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmaxk(k: arkouda.dtypes.int_scalars) pdarray#
Finds the indices corresponding to the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- value_counts()#
Count the occurrences of the unique values of self.
- Returns:
unique_values (pdarray) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs
Examples
>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts() (array([0, 2, 4]), array([3, 2, 1]))
- astype(dtype) pdarray#
Cast values of pdarray to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- slice_bits(low, high) pdarray#
Returns a pdarray containing only bits from low to high of self.
This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)
- Parameters:
low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0
high (int) – The highest bit included in the slice (inclusive)
- Returns:
A new pdarray containing the bits of self from low to high
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> p = ak.array([2**65 + (2**64 - 1)]) >>> bin(p[0]) '0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0]) '0b10'
- bigint_to_uint_arrays() List[pdarray]#
Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Returns:
A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Return type:
List[pdarrays]
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> a = ak.arange(2**64, 2**64 + 5) >>> a array(["18446744073709551616" "18446744073709551617" "18446744073709551618" "18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays() [array([1 1 1 1 1]), array([0 1 2 3 4])]
- reshape(*shape, order='row_major')#
Gives a new shape to an array without changing its data.
- Parameters:
shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.
order (str {'row_major' | 'C' | 'column_major' | 'F'}) – Read the elements of the pdarray in this index order By default, read the elements in row_major or C-like order where the last index changes the fastest If ‘column_major’ or ‘F’, read the elements in column_major or Fortran-like order where the first index changes the fastest
- Returns:
An arrayview object with the data from the array but with the new shape
- Return type:
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same attributes and data as the pdarray
- Return type:
np.ndarray
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_ndarray() array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray()) numpy.ndarray
- to_list() List#
Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A list with the same data as the pdarray
- Return type:
list
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_list() [0, 1, 2, 3, 4]
>>> type(a.to_list()) list
- to_cuda()#
Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.
- Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
- Return type:
numba.DeviceNDArray
- Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_cuda() array([0, 1, 2, 3, 4])
>>> type(a.to_cuda()) numpy.devicendarray
- to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str#
Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_parquet('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_parqet('path/prefix.parquet', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str#
Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_hdf('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_hdf('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving to a single file >>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single') Saves the array in to single hdf5 file on the root node. ``cwd/path/name_prefix.hdf5``
- update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)#
Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)#
Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.
- prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- dataset: str
Column name to save the pdarray under. Defaults to “array”.
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
str reponse message
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string
See also
save_all,load,read,to_parquet,to_hdfNotes
The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. Previously all files saved in Parquet format were saved with a.parquetfile extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.save('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.save('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving with an extension (Parquet) >>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet') Saves the array in numLocales Parquet files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- register(user_defined_name: str) pdarray#
Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.
- Parameters:
user_defined_name (str) – user defined name array is to be registered under
- Returns:
The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.
See also
attach,unregister,is_registered,list_registry,unregister_pdarray_by_nameNotes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- unregister() None#
Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- static attach(user_defined_name: str) pdarray#
class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which array was registered under
- Returns:
pdarray which is bound to the corresponding server side component which was registered with user_defined_name
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- arkouda.arange(*args, **kwargs) arkouda.pdarrayclass.pdarray#
arange([start,] stop[, stride,] dtype=int64)
Create a pdarray of consecutive integers within the interval [start, stop). If only one arg is given then arg is the stop parameter. If two args are given, then the first arg is start and second is stop. If three args are given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
- Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stop (int_scalars) – Stopping value (exclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1, if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Integers from start (inclusive) to stop (exclusive) by stride
- Return type:
pdarray, dtype
- Raises:
TypeError – Raised if start, stop, or stride is not an int object
ZeroDivisionError – Raised if stride == 0
Notes
Negative strides result in decreasing values. Currently, only int64 pdarrays can be created with this method. For float64 arrays, use the linspace method.
Examples
>>> ak.arange(0, 5, 1) array([0, 1, 2, 3, 4])
>>> ak.arange(5, 0, -1) array([5, 4, 3, 2, 1])
>>> ak.arange(0, 10, 2) array([0, 2, 4, 6, 8])
>>> ak.arange(-5, -10, -1) array([-5, -6, -7, -8, -9])
- arkouda.full(size: arkouda.dtypes.int_scalars | str, fill_value: arkouda.dtypes.numeric_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Create a pdarray filled with fill_value.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
fill_value (int_scalars) – Value with which the array will be filled
dtype (all_scalars) – Resulting array type, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
array of the requested size and dtype filled with fill_value
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
Examples
>>> ak.full(5, 7, dtype=ak.int64) array([7, 7, 7, 7, 7])
>>> ak.full(5, 9, dtype=ak.float64) array([9, 9, 9, 9, 9])
>>> ak.full(5, 5, dtype=ak.bool) array([True, True, True, True, True])
- arkouda.ones(size: arkouda.dtypes.int_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray#
Create a pdarray filled with ones.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
dtype (Union[float64, int64, bool]) – Resulting array type, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Ones of the requested size and dtype
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
Examples
>>> ak.ones(5, dtype=ak.int64) array([1, 1, 1, 1, 1])
>>> ak.ones(5, dtype=ak.float64) array([1, 1, 1, 1, 1])
>>> ak.ones(5, dtype=ak.bool) array([True, True, True, True, True])
- arkouda.zeros(size: arkouda.dtypes.int_scalars | str, dtype: numpy.dtype | type | str | arkouda.dtypes.BigInt = float64, max_bits: int | None = None) arkouda.pdarrayclass.pdarray#
Create a pdarray filled with zeros.
- Parameters:
size (int_scalars) – Size of the array (only rank-1 arrays supported)
dtype (all_scalars) – Type of resulting array, default float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Zeros of the requested size and dtype
- Return type:
- Raises:
TypeError – Raised if the supplied dtype is not supported or if the size parameter is neither an int nor a str that is parseable to an int.
See also
Examples
>>> ak.zeros(5, dtype=ak.int64) array([0, 0, 0, 0, 0])
>>> ak.zeros(5, dtype=ak.float64) array([0, 0, 0, 0, 0])
>>> ak.zeros(5, dtype=ak.bool) array([False, False, False, False, False])
- arkouda.concatenate(arrays: Sequence[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical], ordered: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical#
Concatenate a list or tuple of
pdarrayorStringsobjects into onepdarrayorStringsobject, respectively.- Parameters:
arrays (Sequence[Union[pdarray,Strings,Categorical]]) – The arrays to concatenate. Must all have same dtype.
ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.
- Returns:
Single pdarray or Strings object containing all values, returned in the original order
- Return type:
Union[pdarray,Strings,Categorical]
- Raises:
ValueError – Raised if arrays is empty or if 1..n pdarrays have differing dtypes
TypeError – Raised if arrays is not a pdarrays or Strings python Sequence such as a list or tuple
RuntimeError – Raised if 1..n array elements are dtypes for which concatenate has not been implemented.
Examples
>>> ak.concatenate([ak.array([1, 2, 3]), ak.array([4, 5, 6])]) array([1, 2, 3, 4, 5, 6])
>>> ak.concatenate([ak.array([True,False,True]),ak.array([False,True,True])]) array([True, False, True, False, True, True])
>>> ak.concatenate([ak.array(['one','two']),ak.array(['three','four','five'])]) array(['one', 'two', 'three', 'four', 'five'])
- arkouda.in1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False, symmetric: bool = False, invert: bool = False) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable#
Test whether each element of a 1-D array is also present in a second array.
Returns a boolean array the same length as pda1 that is True where an element of pda1 is in pda2 and False otherwise.
Support multi-level – test membership of rows of a in the set of rows of b.
- Parameters:
a (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements for which to test membership in b
b (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements of the set in which to test membership
assume_unique (bool) – If true, assume rows of a and b are each unique and sorted. By default, sort and unique them explicitly.
symmetric (bool) – Return in1d(pda1, pda2), in1d(pda2, pda1) when pda1 and 2 are single items.
invert (bool, optional) – If True, the values in the returned array are inverted (that is, False where an element of pda1 is in pda2 and True otherwise). Default is False.
ak.in1d(a, b, invert=True)is equivalent to (but is faster than)~ak.in1d(a, b).
- Return type:
True for each row in a that is contained in b
Return Type#
pdarray, bool
Notes
Only works for pdarrays of int64 dtype, Strings, or Categorical
- class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.dtypes.int_scalars)#
Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.
- entry#
Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of
offsets array: starting indices for each string
bytes array: raw bytes of all strings joined by nulls
- Type:
- size#
The number of strings in the array
- Type:
int_scalars
- nbytes#
The total number of bytes in all strings
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
The sizes of each dimension of the array
- Type:
tuple
- dtype#
The dtype is ak.str
- Type:
dtype
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
Notes
Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.
- BinOps#
- objType = 'Strings'#
- static from_return_msg(rep_msg: str) Strings#
Factory method for creating a Strings object from an Arkouda server response message
- Parameters:
rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.
- static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings#
Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.
- Parameters:
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.
- get_lengths() arkouda.pdarrayclass.pdarray#
Return the length of each string in the array.
- Returns:
The length of each string
- Return type:
pdarray, int
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- get_bytes()#
Getter for the bytes component (uint8 pdarray) of this Strings.
- Returns:
Pdarray of bytes of the string accessed
- Return type:
pdarray, uint8
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_bytes() [111 110 101 0 116 119 111 0 116 104 114 101 101 0]
- get_offsets()#
Getter for the offsets component (int64 pdarray) of this Strings.
- Returns:
Pdarray of offsets of the string accessed
- Return type:
pdarray, int64
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_offsets() [0 4 8]
- encode(toEncoding: str, fromEncoding: str = 'UTF-8')#
Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding
- Parameters:
toEncoding (str) – The encoding that the strings will be converted to
fromEncoding (str) – The current encoding of the strings object, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- decode(fromEncoding, toEncoding='UTF-8')#
Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding
- Parameters:
fromEncoding (str) – The current encoding of the strings object
toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- to_lower() Strings#
Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Returns:
Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_lower() array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
- to_upper() Strings#
Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Returns:
Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_upper() array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
- to_title() Strings#
Returns a new Strings from the original replaced with their titlecase equivalent
- Returns:
Strings from the original replaced with their titlecase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Strings.to_lower,String.to_upperExamples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_title() array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
- is_lower() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase
- Returns:
True for elements that are entirely lowercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_lower() array([True True True False False False])
- is_upper() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase
- Returns:
True for elements that are entirely uppercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_upper() array([False False False True True True])
- is_title() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase
- Returns:
True for elements that are titlecase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)]) >>> title = ak.array([f'Strings {i}' for i in range(3)]) >>> strings = ak.concatenate([mixed, title]) >>> strings array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2']) >>> strings.is_title() array([False False False True True True])
- strip(chars: bytes | arkouda.dtypes.str_scalars | None = '') Strings#
Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
- Parameters:
chars – the set of characters to be removed
- Returns:
Strings object with the leading and trailing characters matching the set of characters in the chars argument removed
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> strings = ak.array(['Strings ', ' StringS ', 'StringS ']) >>> s = strings.strip() >>> s array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS ', ' 1StringS 12 ']) >>> s = strings.strip(' 12') >>> s array(['Strings', 'StringS', 'StringS'])
- cached_regex_patterns() List#
Returns the regex patterns for which Match objects have been cached
- purge_cached_regex_patterns() None#
purges cached regex patterns
- find_locations(pattern: bytes | arkouda.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches
- Parameters:
pattern (str_scalars) – The regex pattern used to find matches
- Returns:
pdarray, int64 – For each original string, the number of pattern matches
pdarray, int64 – The start positons of pattern matches
pdarray, int64 – The lengths of pattern matches
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> num_matches, starts, lens = strings.find_locations('\d') >>> num_matches array([2, 2, 2, 2, 2]) >>> starts array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9]) >>> lens array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
- search(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match if any part of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+') <ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- match(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the beginning of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the beginning of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.match('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- fullmatch(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the whole string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the whole string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.fullmatch('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=False; matched=False>
- split(pattern: bytes | arkouda.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple#
Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur
- Parameters:
pattern (str) – Regex used to split strings into substrings
maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences
return_segments (bool) – If True, return mapping of original strings to first substring in return array.
- Returns:
Strings – Substrings with pattern matches removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.split('_+', maxsplit=2, return_segments=True) (array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
- findall(pattern: bytes | arkouda.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple#
Return a new Strings containg all non-overlapping matches of pattern
- Parameters:
pattern (str_scalars) – Regex used to find matches
return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from
- Returns:
Strings – Strings object containing only pattern matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.findall('_+', return_match_origins=True) (array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
- sub(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Strings#
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings with pattern matches replaced
- Return type:
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.sub(pattern='_+', repl='-', count=2) array(['1-2-', '-', '3', '-4-5____6___7', ''])
- subn(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Tuple#
Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings – Strings with pattern matches replaced
pdarray, int64 – The number of substitutions made for each element of Strings
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.subn(pattern='_+', repl='-', count=2) (array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
- contains(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element contains the given substring.
- Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> strings array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5']) >>> strings.contains('string') array([True, True, True, True, True]) >>> strings.contains('string \d', regex=True) array([True, True, True, True, True])
- startswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The prefix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not a bytes ior str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.startswith('string') array([True, True, True, True, True]) >>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.startswith('\d str', regex = True) array([True, True, True, True, True])
- endswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The suffix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.endswith('ing') array([True, True, True, True, True]) >>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.endswith('ing \d', regex = True) array([True, True, True, True, True])
- flatten(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple#
Unpack delimiter-joined substrings into a flat array.
- Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring in return array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> orig = ak.array(['one|two', 'three|four|five', 'six']) >>> orig.flatten('|') array(['one', 'two', 'three', 'four', 'five', 'six']) >>> flat, map = orig.flatten('|', return_segments=True) >>> map array([0, 2, 5]) >>> under = ak.array(['one_two', 'three_____four____five', 'six']) >>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True) >>> under_flat array(['one', 'two', 'three', 'four', 'five', 'six']) >>> under_map array([0, 2, 5])
- peel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple#
Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The field(s) peeled from the end of each string (unless fromRight is true)
- right: Strings
The remainder of each string after peeling (unless fromRight is true)
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g'])) >>> s.peel('.', includeDelimiter=True) (array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g'])) >>> s.peel('.', times=2) (array(['', '', 'e.f']), array(['a.b', 'c.d', 'g'])) >>> s.peel('.', times=2, keepPartial=True) (array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
- rpeel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)#
Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The remainder of the string after peeling
- right: Strings
The field(s) that were peeled from the right of each string
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.rpeel('.') (array(['a', 'c', 'e.f']), array(['b', 'd', 'g'])) # Compared against peel >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
- stick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '', toLeft: bool = False) Strings#
Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.
- Returns:
The array of joined strings
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance
ValueError – Raised if times is < 1
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.stick(t, delimiter='.') array(['a.b', 'c.d', 'e.f'])
- lstick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '') Strings#
Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
- Returns:
The array of joined strings, as other + self
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.lstick(t, delimiter='.') array(['b.a', 'd.c', 'f.e'])
- get_prefixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long prefix of each string, where possible
- Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.
- Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.
- get_suffixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long suffix of each string, where possible
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.
- Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Compute a 128-bit hash of each string.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- group() arkouda.pdarrayclass.pdarray#
Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
Notes
If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.
- Raises:
RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same strings as this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_ndarray() array(['hello', 'my', 'world'], dtype='<U5') >>> type(a.to_ndarray()) numpy.ndarray
- to_list() list#
Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list with the same strings as this SegString
- Return type:
list
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_list() ['hello', 'my', 'world'] >>> type(a.to_list()) list
- astype(dtype) arkouda.pdarrayclass.pdarray#
Cast values of Strings object to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str#
Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str#
Save the Strings object to HDF5. The object can be saved to a collection of files or single file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must have write permission.
Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path.If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a
RuntimeErrorwill result.Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
See also
- update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)#
Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)#
Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
str reponse message
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n) at this time.
- save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
- Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- register(user_defined_name: str) Strings#
Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.
- Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
- Returns:
The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- unregister() None#
Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- static attach(user_defined_name: str) Strings#
class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which the Strings object was registered under
- Returns:
the Strings object registered with user_defined_name in the arkouda server
- Return type:
Strings object
- Raises:
TypeError – Raised if user_defined_name is not a str
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- static unregister_strings_by_name(user_defined_name: str) None#
Unregister a Strings object in the arkouda server previously registered via register()
- Parameters:
user_defined_name (str) – The registered name of the Strings object
See also
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a Strings object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.unsqueeze(p)#
- arkouda.zero_up(vals)#
Map an array of sparse values to 0-up indices.
- arkouda.align(*args)#
Map multiple arrays of sparse identifiers to a common 0-up index.
- Parameters:
*args (pdarrays) – Arrays to map to dense index
- Returns:
aligned – Arrays with values replaced by 0-up indices
- Return type:
list of pdarrays
- arkouda.right_align(left, right)#
Map two arrays of sparse values to the 0-up index set implied by the right array, discarding values from left that do not appear in right.
- arkouda.left_align(left, right)#
Map two arrays of sparse identifiers to the 0-up index set implied by the left array, discarding values from right that do not appear in left.
- exception arkouda.NonUniqueError#
Bases:
ValueErrorInappropriate argument value (of correct type).
- arkouda.find(query, space)#
Return indices of query items in a search list of items (-1 if not found).
- Parameters:
query ((sequence of) array-like) – The items to search for. If multiple arrays, each “row” is an item.
space ((sequence of) array-like) – The set of items in which to search. Must have same shape/dtype as query.
- Returns:
indices – For each item in query, its index in space or -1 if not found.
- Return type:
pdarray, int64
- arkouda.lookup(keys, values, arguments, fillvalue=-1)#
Apply the function defined by the mapping keys –> values to arguments.
- Parameters:
keys ((sequence of) array-like) – The domain of the function. Entries must be unique (if a sequence of arrays is given, each row is treated as a tuple-valued entry).
values (pdarray) – The range of the function. Must be same length as keys.
arguments ((sequence of) array-like) – The arguments on which to evaluate the function. Must have same dtype (or tuple of dtypes, for a sequence) as keys.
fillvalue (scalar) – The default value to return for arguments not in keys.
- Returns:
evaluated – The result of evaluating the function over arguments.
- Return type:
Notes
While the values cannot be Strings (or other complex objects), the same result can be achieved by passing an arange as the values, then using the return as indices into the desired object.
Examples
# Lookup numbers by two-word name >>> keys1 = ak.array([‘twenty’ for _ in range(5)]) >>> keys2 = ak.array([‘one’, ‘two’, ‘three’, ‘four’, ‘five’]) >>> values = ak.array([21, 22, 23, 24, 25]) >>> args1 = ak.array([‘twenty’, ‘thirty’, ‘twenty’]) >>> args2 = ak.array([‘four’, ‘two’, ‘two’]) >>> aku.lookup([keys1, keys2], values, [args1, args2]) array([24, -1, 22])
# Other direction requires an intermediate index >>> revkeys = values >>> revindices = ak.arange(values.size) >>> revargs = ak.array([24, 21, 22]) >>> idx = aku.lookup(revkeys, revindices, revargs) >>> keys1[idx], keys2[idx] (array([‘twenty’, ‘twenty’, ‘twenty’]), array([‘four’, ‘one’, ‘two’]))
- arkouda.in1d_intervals(vals, intervals, symmetric=False)#
Test each value for membership in any of a set of half-open (pythonic) intervals.
- Parameters:
vals (pdarray(int, float)) – Values to test for membership in intervals
intervals (2-tuple of pdarrays) – Non-overlapping, half-open intervals, as a tuple of (lower_bounds_inclusive, upper_bounds_exclusive)
symmetric (bool) – If True, also return boolean pdarray indicating which intervals contained one or more query values.
- Returns:
pdarray(bool) – Array of same length as <vals>, True if corresponding value is included in any of the ranges defined by (low[i], high[i]) inclusive.
pdarray(bool) (if symmetric=True) – Array of same length as number of intervals, True if corresponding interval contains any of the values in <vals>.
Notes
- First return array is equivalent to the following:
((vals >= intervals[0][0]) & (vals < intervals[1][0])) | ((vals >= intervals[0][1]) & (vals < intervals[1][1])) | … ((vals >= intervals[0][-1]) & (vals < intervals[1][-1]))
But much faster when testing many ranges.
- Second (optional) return array is equivalent to:
((intervals[0] <= vals[0]) & (intervals[1] > vals[0])) | ((intervals[0] <= vals[1]) & (intervals[1] > vals[1])) | … ((intervals[0] <= vals[-1]) & (intervals[1] > vals[-1]))
But much faster when vals is non-trivial size.
- arkouda.search_intervals(vals, intervals, tiebreak=None, hierarchical=True)#
Given an array of query vals and non-overlapping, closed intervals, return the index of the best (see tiebreak) interval containing each query value, or -1 if not present in any interval.
- Parameters:
vals ((sequence of) pdarray(int, uint, float)) – Values to search for in intervals. If multiple arrays, each “row” is an item.
intervals (2-tuple of (sequences of) pdarrays) – Non-overlapping, half-open intervals, as a tuple of (lower_bounds_inclusive, upper_bounds_exclusive) Must have same dtype(s) as vals.
tiebreak ((optional) pdarray, numeric) – When a value is present in more than one interval, the interval with the lowest tiebreak value will be chosen. If no tiebreak is given, the first containing interval will be chosen.
hierarchical (boolean) – When True, sequences of pdarrays will be treated as components specifying a single dimension (i.e. hierarchical) When False, sequences of pdarrays will be specifying multi-dimensional intervals
- Returns:
idx – Index of interval containing each query value, or -1 if not found
- Return type:
pdarray(int64)
Notes
- The return idx satisfies the following condition:
present = idx > -1 ((intervals[0][idx[present]] <= vals[present]) &
(intervals[1][idx[present]] >= vals[present])).all()
Examples
>>> starts = (ak.array([0, 5]), ak.array([0, 11])) >>> ends = (ak.array([5, 9]), ak.array([10, 20])) >>> vals = (ak.array([0, 0, 2, 5, 5, 6, 6, 9]), ak.array([0, 20, 1, 5, 15, 0, 12, 30])) >>> ak.search_intervals(vals, (starts, ends), hierarchical=False) array([0 -1 0 0 1 -1 1 -1]) >>> ak.search_intervals(vals, (starts, ends)) array([0 0 0 0 1 1 1 -1]) >>> bi_starts = ak.bigint_from_uint_arrays([ak.cast(a, ak.uint64) for a in starts]) >>> bi_ends = ak.bigint_from_uint_arrays([ak.cast(a, ak.uint64) for a in ends]) >>> bi_vals = ak.bigint_from_uint_arrays([ak.cast(a, ak.uint64) for a in vals]) >>> bi_starts, bi_ends, bi_vals (array(["0" "92233720368547758091"]), array(["92233720368547758090" "166020696663385964564"]), array(["0" "20" "36893488147419103233" "92233720368547758085" "92233720368547758095" "110680464442257309696" "110680464442257309708" "166020696663385964574"])) >>> ak.search_intervals(bi_vals, (bi_starts, bi_ends)) array([0 0 0 0 1 1 1 -1])
- arkouda.is_cosorted(arrays)#
Return True iff the arrays are cosorted, i.e., if the arrays were columns in a table then the rows are sorted.
- Parameters:
arrays (list-like of pdarrays) – Arrays to check for cosortedness
- Returns:
True iff arrays are cosorted.
- Return type:
bool
- Raises:
ValueError – Raised if arrays are not the same length
TypeError – Raised if arrays is not a list-like of pdarrays
- arkouda.interval_lookup(keys, values, arguments, fillvalue=-1, tiebreak=None, hierarchical=False)#
Apply a function defined over intervals to an array of arguments.
- Parameters:
keys (2-tuple of (sequences of) pdarrays) – Tuple of closed intervals expressed as (lower_bounds_inclusive, upper_bounds_inclusive). Must have same dtype(s) as vals.
values (pdarray) – Function value to return for each entry in keys.
arguments ((sequences of) pdarray) – Values to search for in intervals. If multiple arrays, each “row” is an item.
fillvalue (scalar) – Default value to return when argument is not in any interval.
tiebreak ((optional) pdarray, numeric) – When an argument is present in more than one key interval, the interval with the lowest tiebreak value will be chosen. If no tiebreak is given, the first valid key interval will be chosen.
- Returns:
Value of function corresponding to the keys interval containing each argument, or fillvalue if argument not in any interval.
- Return type:
- class arkouda.DataFrame(initialdata=None, index=None)#
Bases:
collections.UserDictA DataFrame structure based on arkouda arrays.
Examples
Create an empty DataFrame and add a column of data:
>>> import arkouda as ak >>> import numpy as np >>> import pandas as pd >>> df = ak.DataFrame() >>> df['a'] = ak.array([1,2,3])
Create a new DataFrame using a dictionary of data:
>>> userName = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice']) >>> userID = ak.array([111, 222, 111, 333, 222, 111]) >>> item = ak.array([0, 0, 1, 1, 2, 0]) >>> day = ak.array([5, 5, 6, 5, 6, 6]) >>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6]) >>> df = ak.DataFrame({'userName': userName, 'userID': userID, >>> 'item': item, 'day': day, 'amount': amount}) >>> df DataFrame(['userName', 'userID', 'item', 'day', 'amount'] [6 rows : 224 B])
Indexing works slightly differently than with pandas: >>> df[0] {‘userName’: ‘Alice’, ‘userID’: 111, ‘item’: 0, ‘day’: 5, ‘amount’: 0.5} >>> df[‘userID’] array([111, 222, 111, 333, 222, 111]) >>> df[‘userName’] array([‘Alice’, ‘Bob’, ‘Alice’, ‘Carol’, ‘Bob’, ‘Alice’]) >>> df[[1,5,7]]
userName userID item day amount
1 Bob 222 0 5 0.6 2 Alice 111 1 6 1.1 3 Carol 333 1 5 1.2
Note that strides are not implemented except for stride = 1. >>> df[1:5:1] DataFrame([‘userName’, ‘userID’, ‘item’, ‘day’, ‘amount’] [4 rows : 148 B]) >>> df[ak.array([1,2,3])] DataFrame([‘userName’, ‘userID’, ‘item’, ‘day’, ‘amount’] [3 rows : 112 B]) >>> df[[‘userID’, ‘day’]] DataFrame([‘userID’, ‘day’] [6 rows : 96 B])
- property size#
Returns the number of bytes on the arkouda server.
- property dtypes#
- property empty#
- property shape#
- property columns#
- property index#
- property info#
Returns a summary string of this dataframe.
- COLUMN_CLASSES = ()#
- objType = 'DataFrame'#
- transfer(hostname, port)#
Sends a DataFrame to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the DataFrame is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- classmethod from_pandas(pd_df)#
- drop(keys: str | int | List[str | int], axis: str | int = 0, inplace: bool = False) None | DataFrame#
Drop column/s or row/s from the dataframe.
- Parameters:
keys (str, int or list) – The labels to be dropped on the given axis
axis (int or str) – The axis on which to drop from. 0/’index’ - drop rows, 1/’columns’ - drop columns
inplace (bool) – Default False. When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False
None when inplace=True
Examples
Drop column >>> df.drop(‘col_name’, axis=1)
Drop Row >>> df.drop(1) or >>> df.drop(1, axis=0)
- drop_duplicates(subset=None, keep='first')#
Drops duplcated rows and returns resulting DataFrame.
If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).
- Parameters:
subset (Iterable of column names to use to dedupe.) –
keep ({'first', 'last'}, default 'first') – Determines which duplicates (if any) to keep.
- Returns:
DataFrame with duplicates removed.
- Return type:
- reset_index(size: bool = False, inplace: bool = False) None | DataFrame#
Set the index to an integer range.
Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.
- Parameters:
size (int) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.
inplace (bool) – Default False. When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False
None when inplace=True
Note
Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.
- update_size()#
Computes the number of bytes on the arkouda server.
- rename(mapper: Callable | Dict | None = None, index: Callable | Dict | None = None, column: Callable | Dict | None = None, axis: str | int = 0, inplace: bool = False) DataFrame | None#
Rename indexes or columns according to a mapping.
- Parameters:
mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. Uses the value of axis to determine if renaming column or index
column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored.
index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored
axis (int or str) – Default 0. Indicates which axis to perform the rename. 0/”index” - Indexes 1/”column” - Columns
inplace (bool) – Default False. When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False
None when inplace=True
Examples
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) Rename columns using a mapping >>> df.rename(columns={'A':'a', 'B':'c'}) a c 0 1 4 1 2 5 2 3 6
Rename indexes using a mapping >>> df.rename(index={0:99, 2:11})
A B
99 1 4 1 2 5 11 3 6
Rename using an axis style parameter >>> df.rename(str.lower, axis=’column’)
a b
0 1 4 1 2 5 2 3 6
- append(other, ordered=True)#
Concatenate data from ‘other’ onto the end of this DataFrame, in place.
Explicitly, use the arkouda concatenate function to append the data from each column in other to the end of self. This operation is done in place, in the sense that the underlying pdarrays are updated from the result of the arkouda concatenate function, rather than returning a new DataFrame object containing the result.
- Parameters:
other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.
ordered (bool) – If False, allow rows to be interleaved for better performance (but data within a row remains together). By default, append all rows to the end, in input order.
- Returns:
Appending occurs in-place, but result is returned for compatibility.
- Return type:
self
- classmethod concat(items, ordered=True)#
Essentially an append, but diffenent formatting
- head(n=5)#
Return the first n rows.
This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.
- Parameters:
n (int) – Number of rows to select.
- Returns:
The first n rows of the DataFrame.
- Return type:
ak.DataFrame
See also
- tail(n=5)#
Return the last n rows.
This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.
- Parameters:
n (int (default=5)) – Number of rows to select.
- Returns:
The last n rows of the DataFrame.
- Return type:
ak.DataFrame
See also
ak.dataframe.head
- sample(n=5)#
Return a random sample of n rows.
- Parameters:
n (int (default=5)) – Number of rows to return.
- Returns:
The sampled n rows of the DataFrame.
- Return type:
ak.DataFrame
- GroupBy(keys, use_series=False)#
Group the dataframe by a column or a list of columns.
- Parameters:
keys (string or list) – An (ordered) list of column names or a single string to group by.
use_series (If True, returns an ak.GroupBy oject. Otherwise an arkouda GroupBy object) –
- Returns:
Either an ak GroupBy or an arkouda GroupBy object.
- Return type:
See also
- memory_usage(unit='GB')#
Print the size of this DataFrame.
- Parameters:
unit (str) – Unit to return. One of {‘KB’, ‘MB’, ‘GB’}.
- Returns:
The number of bytes used by this DataFrame in [unit]s.
- Return type:
int
- to_pandas(datalimit=maxTransferBytes, retain_index=False)#
Send this DataFrame to a pandas DataFrame.
- Parameters:
datalimit (int (default=arkouda.client.maxTransferBytes)) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.
retain_index (book (default=False)) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.
- Returns:
The result of converting this DataFrame to a pandas DataFrame.
- Return type:
pandas.DataFrame
- to_hdf(path, index=False, columns=None, file_type='distribute')#
Save DataFrame to disk as hdf5, preserving column names.
- Parameters:
path (str) – File path to save data
index (bool) – If True, save the index column. By default, do not save the index.
columns (List) – List of columns to include in the file. If None, writes out all columns
file_type (str (single | distribute)) – Default: distribute Whether to save to a single file or distribute across Locales
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
See also
- update_hdf(prefix_path: str, index=False, columns=None, repack: bool = True)#
Overwrite the dataset with the name provided with this dataframe. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
index (bool) – If True, save the index column. By default, do not save the index.
columns (List) – List of columns to include in the file. If None, writes out all columns
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_parquet(path, index=False, columns=None, compression: str | None = None, convert_categoricals: bool = False)#
Save DataFrame to disk as parquet, preserving column names.
- Parameters:
path (str) – File path to save data
index (bool) – If True, save the index column. By default, do not save the index.
columns (List) – List of columns to include in the file. If None, writes out all columns
compression (str (Optional)) – Default None Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
convert_categoricals (bool) – Defaults to False Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. if set, write the equivalent Strings in place of any Categorical columns.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
- to_csv(path: str, index: bool = False, columns: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)#
Writes DataFrame to CSV file(s). File will contain a column for each column in the DataFrame. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- index: bool
Defaults to False. If True, the index of the DataFrame will be written to the file as a column
- columns: List[str] (Optional)
Column names to assign when writing data
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
None
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- classmethod read_csv(filename: str, col_delim: str = ',')#
Read the columns of a CSV file into an Arkouda DataFrame. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings objects.
- filename: str
Filename to read data from
- col_delim: str
Defaults to “,”. The delimiter for columns within the data.
Arkouda DataFrame containing the columns from the CSV file.
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
to_csv
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
- `) at this time.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
- save(path, index=False, columns=None, file_format='HDF5', file_type='distribute', compression: str | None = None)#
DEPRECATED Save DataFrame to disk, preserving column names. :param path: File path to save data :type path: str :param index: If True, save the index column. By default, do not save the index. :type index: bool :param columns: List of columns to include in the file. If None, writes out all columns :type columns: List :param file_format: ‘HDF5’ or ‘Parquet’. Defaults to ‘HDF5’ :type file_format: str :param file_type: (“single” | “distribute”)
Defaults to distribute. If single, will right a single file to locale 0
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Compression type. Only used for Parquet
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
See also
- classmethod load(prefix_path, file_format='INFER')#
Load dataframe from file file_format needed for consistency with other load functions
- argsort(key, ascending=True)#
Return the permutation that sorts the dataframe by key.
- Parameters:
key (str) – The key to sort on.
- Returns:
The permutation array that sorts the data on key.
- Return type:
ak.pdarray
- coargsort(keys, ascending=True)#
Return the permutation that sorts the dataframe by keys.
Sorting using Strings may not yield correct results
- Parameters:
keys (list) – The keys to sort on.
- Returns:
The permutation array that sorts the data on keys.
- Return type:
ak.pdarray
- sort_values(by=None, ascending=True)#
Sort the DataFrame by one or more columns.
If no column is specified, all columns are used.
Note: Fails on sorting ak.Strings when multiple columns being sorted
- Parameters:
by (str or list/tuple of str) – The name(s) of the column(s) to sort by.
ascending (bool) – Sort values in ascending (default) or descending order.
See also
- apply_permutation(perm)#
Apply a permutation to an entire DataFrame.
This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.
- Parameters:
perm (ak.pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.
See also
sort
- filter_by_range(keys, low=1, high=None)#
Find all rows where the value count of the items in a given set of columns (keys) is within the range [low, high].
To filter by a specific value, set low == high.
- Parameters:
keys (list or str) – The names of the columns to group by
low (int (default=1)) – The lowest value count.
high (int (default=None)) – The highest value count, default to unlimited.
- Returns:
An array of boolean values for qualified rows in this DataFrame.
- Return type:
See also
filter_by_count
- copy(deep=True)#
Make a copy of this object’s data.
When deep = True (default), a new object will be created with a copy of the calling object’s data. Modifications to the data of the copy will not be reflected in the original object.
When deep = False a new object will be created without copying the calling object’s data. Any changes to the data of the original object will be reflected in the shallow copy, and vice versa.
- Parameters:
deep (bool (default=True)) – When True, return a deep copy. Otherwise, return a shallow copy.
- Returns:
A deep or shallow copy according to caller specification.
- Return type:
aku.DataFrame
- groupby(keys, use_series=True)#
Group the dataframe by a column or a list of columns. Alias for GroupBy
- Parameters:
keys (a single column name or a list of column names) –
use_series (Change return type to Arkouda Groupby object.) –
- Return type:
An arkouda Groupby instance
- isin(values: arkouda.pdarrayclass.pdarray | Dict | arkouda.series.Series | DataFrame) DataFrame#
Determine whether each element in the DataFrame is contained in values.
- Parameters:
values (pdarray, dict, Series, or DataFrame) – The values to check for in DataFrame. Series can only have a single index.
- Returns:
Arkouda DataFrame of booleans showing whether each element in the DataFrame is contained in values
- Return type:
See also
ak.Series.isinNotes
Pandas supports values being an iterable type. In arkouda, we replace this with pdarray
Pandas supports ~ operations. Currently, ak.DataFrame does not support this.
Examples
>>> df = ak.DataFrame({'col_A': ak.array([7, 3]), 'col_B':ak.array([1, 9])}) >>> df col_A col_B 0 7 1 1 3 9 (2 rows x 2 columns)
When values is a pdarray, check every value in the DataFrame to determine if it exists in values >>> df.isin(ak.array([0, 1]))
col_A col_B
0 False True 1 False False (2 rows x 2 columns)
When values is a dict, the values in the dict are passed to check the column indicated by the key >>> df.isin({‘col_A’: ak.array([0, 3])})
col_A col_B
0 False False 1 True False (2 rows x 2 columns)
When values is a Series, each column is checked if values is present positionally. This means that for True to be returned, the indexes must be the same. >>> i = ak.Index(ak.arange(2)) >>> s = ak.Series(data=[3, 9], index=i) >>> df.isin(s)
col_A col_B
0 False False 1 False True (2 rows x 2 columns)
When values is a DataFrame, the index and column must match. Note that 9 is not found because the column name does not match. >>> other_df = ak.DataFrame({‘col_A’:ak.array([7, 3]), ‘col_C’:ak.array([0, 9])}) >>> df.isin(other_df)
col_A col_B
0 True False 1 True False (2 rows x 2 columns)
- corr() DataFrame#
Return new DataFrame with pairwise correlation of columns
- Returns:
Arkouda DataFrame containing correlation matrix of all columns
- Return type:
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
See also
Notes
Generates the correlation matrix using Pearson R for all columns
Attempts to convert to numeric values where possible for inclusion in the matrix.
- inner_join_merge(right: DataFrame, on: str, left_suffix: str = '_x', right_suffix: str = '_y') DataFrame#
Utilizes the ak.join.inner_join function to return an ak DataFrame object containing only rows that are in both self and right Dataframes, (based on the “on” param), as well as their associated values. For this function self is considered the left dataframe.
- Parameters:
right (DataFrame) – The Right DataFrame to be joined
on (str) – The name of the DataFrame column the join is being performed on
left_suffix (str = "_x") – A string indicating the suffix to add to columns from self for overlapping column names in both left and right. Defaults to “_x”
right_suffix (str = "_y") – A string indicating the suffix to add to columns from the other dataframe for overlapping column names in both left and right. Defaults to “_y”
- Returns:
Inner-Joined Arkouda DataFrame
- Return type:
- right_join_merge(right: DataFrame, on: str) DataFrame#
Utilizes the ak.join.inner_join_merge function to return an ak DataFrame object containing all the rows in the right Dataframe, as well as corresponding rows in self (based on the “on” param), and all of their associated values. For this function self is considered the left dataframe. Based on pandas merge functionality.
- merge(right: DataFrame, on: str, how: str, left_suffix: str = '_x', right_suffix: str = '_y') DataFrame#
Utilizes the ak.join.inner_join_merge and the ak.join.right_join_merge functions to return a merged Arkouda DataFrame object containing rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters). For this function self is considered the left dataframe. Based on pandas merge functionality. https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/merge.py#L137
- Parameters:
right (DataFrame) – The Right DataFrame to be joined
on (str) – The name of the DataFrame column the join is being performed on
how (str) – The merge condition. Must be “inner”, “left”, or “right”
left_suffix (str = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”
right_suffix (str = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”
- Returns:
Joined Arkouda DataFrame
- Return type:
- register(user_defined_name: str) DataFrame#
Register this DataFrame object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the DataFrame is to be registered under, this will be the root name for underlying components
- Returns:
The same DataFrame which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different DataFrames with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the DataFrame with the user_defined_name
See also
unregister,attach,unregister_dataframe_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
Any changes made to a DataFrame object after registering with the server may not be reflected in attached copies.
- unregister()#
Unregister this DataFrame object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
register,attach,unregister_dataframe_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() bool#
Return True if the object is contained in the registry
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static attach(user_defined_name: str) DataFrame#
Function to return a DataFrame object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which DataFrame object was registered under
- Returns:
The DataFrame object created by re-attaching to the corresponding server components
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
- static unregister_dataframe_by_name(user_defined_name: str) None#
Function to unregister DataFrame object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the DataFrame object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- classmethod from_return_msg(rep_msg)#
- class arkouda.Datetime(pda, unit: str = _BASE_UNIT)#
Bases:
_AbstractBaseTimeRepresents a date and/or time.
Datetime is the Arkouda analog to pandas DatetimeIndex and other timeseries data types.
- Parameters:
pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array) –
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The
.valuesattribute is always in nanoseconds with int64 dtype.- property nanosecond#
- property microsecond#
- property millisecond#
- property second#
- property minute#
- property hour#
- property day#
- property month#
- property year#
- property day_of_year#
- property dayofyear#
- property day_of_week#
- property dayofweek#
- property weekday#
- property week#
- property weekofyear#
- property date#
- property is_leap_year#
- supported_with_datetime#
- supported_with_r_datetime#
- supported_with_timedelta#
- supported_with_r_timedelta#
- supported_opeq#
- supported_with_pdarray#
- supported_with_r_pdarray#
- special_objType = 'Datetime'#
- isocalendar()#
- to_pandas()#
Convert array to a pandas DatetimeIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.
See also
to_ndarray
- sum()#
Return the sum of all elements in the array.
- register(user_defined_name)#
Register this Datetime object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Datetime is to be registered under, this will be the root name for underlying components
- Returns:
The same Datetime which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Datetimes with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Datetimes with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this Datetime object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- class arkouda.Timedelta(pda, unit: str = _BASE_UNIT)#
Bases:
_AbstractBaseTimeRepresents a duration, the difference between two dates or times.
Timedelta is the Arkouda equivalent of pandas.TimedeltaIndex.
- Parameters:
pda (int64 pdarray, pd.TimedeltaIndex, pd.Series, or np.timedelta64 array) –
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The
.valuesattribute is always in nanoseconds with int64 dtype.- property nanoseconds#
- property microseconds#
- property seconds#
- property days#
- property components#
- supported_with_datetime#
- supported_with_r_datetime#
- supported_with_timedelta#
- supported_with_r_timedelta#
- supported_opeq#
- supported_with_pdarray#
- supported_with_r_pdarray#
- special_objType = 'Timedelta'#
- total_seconds()#
- to_pandas()#
Convert array to a pandas TimedeltaIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.
See also
to_ndarray
- std(ddof: arkouda.dtypes.int_scalars = 0)#
Returns the standard deviation as a pd.Timedelta object
- sum()#
Return the sum of all elements in the array.
- abs()#
Absolute value of time interval.
- register(user_defined_name)#
Register this Timedelta object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the timedelta is to be registered under, this will be the root name for underlying components
- Returns:
The same Timedelta which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Timedeltas with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the timedelta with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this timedelta object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- arkouda.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, inclusive='both', **kwargs)#
Creates a fixed frequency Datetime range. Alias for
ak.Datetime(pd.date_range(args)). Subject to size limit imposed by client.maxTransferBytes.- Parameters:
start (str or datetime-like, optional) – Left bound for generating dates.
end (str or datetime-like, optional) – Right bound for generating dates.
periods (int, optional) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’. See timeseries.offset_aliases for a list of frequency aliases.
tz (str or tzinfo, optional) – Time zone name for returning localized DatetimeIndex, for example ‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is timezone-naive.
normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.
name (str, default None) – Name of the resulting DatetimeIndex.
closed ({None, 'left', 'right'}, optional) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None, the default). Deprecated
inclusive ({"both", "neither", "left", "right"}, default "both") – Include boundaries. Whether to set each bound as closed or open.
**kwargs – For compatibility. Has no effect on the result.
- Returns:
rng
- Return type:
DatetimeIndex
Notes
Of the four parameters
start,end,periods, andfreq, exactly three must be specified. Iffreqis omitted, the resultingDatetimeIndexwill haveperiodslinearly spaced elements betweenstartandend(closed on both sides).To learn more about the frequency strings, please see this link.
- arkouda.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None, **kwargs)#
Return a fixed frequency TimedeltaIndex, with day as the default frequency. Alias for
ak.Timedelta(pd.timedelta_range(args)). Subject to size limit imposed by client.maxTransferBytes.- Parameters:
start (str or timedelta-like, default None) – Left bound for generating timedeltas.
end (str or timedelta-like, default None) – Right bound for generating timedeltas.
periods (int, default None) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’.
name (str, default None) – Name of the resulting TimedeltaIndex.
closed (str, default None) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None).
- Returns:
rng
- Return type:
TimedeltaIndex
Notes
Of the four parameters
start,end,periods, andfreq, exactly three must be specified. Iffreqis omitted, the resultingTimedeltaIndexwill haveperiodslinearly spaced elements betweenstartandend(closed on both sides).To learn more about the frequency strings, please see this link.
- arkouda.skew(pda: pdarray, bias: bool = True) numpy.float64#
Computes the sample skewness of an array. Skewness > 0 means there’s greater weight in the right tail of the distribution. Skewness < 0 means there’s greater weight in the left tail of the distribution. Skewness == 0 means the data is normally distributed. Based on the scipy.stats.skew function.
- Parameters:
pda (pdarray) – A pdarray of values that will be calculated to find the skew
bias (bool, optional) – If False, then the calculations are corrected for statistical bias.
- Returns:
The skew of all elements in the array
- Return type:
np.float64
Examples: >>> a = ak.array([1, 1, 1, 5, 10]) >>> ak.skew(a) 0.9442193396379163
- arkouda.arange(*args, **kwargs) arkouda.pdarrayclass.pdarray#
arange([start,] stop[, stride,] dtype=int64)
Create a pdarray of consecutive integers within the interval [start, stop). If only one arg is given then arg is the stop parameter. If two args are given, then the first arg is start and second is stop. If three args are given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
- Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stop (int_scalars) – Stopping value (exclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1, if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Integers from start (inclusive) to stop (exclusive) by stride
- Return type:
pdarray, dtype
- Raises:
TypeError – Raised if start, stop, or stride is not an int object
ZeroDivisionError – Raised if stride == 0
Notes
Negative strides result in decreasing values. Currently, only int64 pdarrays can be created with this method. For float64 arrays, use the linspace method.
Examples
>>> ak.arange(0, 5, 1) array([0, 1, 2, 3, 4])
>>> ak.arange(5, 0, -1) array([5, 4, 3, 2, 1])
>>> ak.arange(0, 10, 2) array([0, 2, 4, 6, 8])
>>> ak.arange(-5, -10, -1) array([-5, -6, -7, -8, -9])
- arkouda.histogram(pda: arkouda.pdarrayclass.pdarray, bins: arkouda.dtypes.int_scalars = 10) Tuple[numpy.ndarray, arkouda.pdarrayclass.pdarray]#
Compute a histogram of evenly spaced bins over the range of an array.
- Parameters:
pda (pdarray) – The values to histogram
bins (int_scalars) – The number of equal-size bins to use (default: 10)
- Returns:
Bin edges and The number of values present in each bin
- Return type:
(np.ndarray, Union[pdarray, int64 or float64])
- Raises:
TypeError – Raised if the parameter is not a pdarray or if bins is not an int.
ValueError – Raised if bins < 1
NotImplementedError – Raised if pdarray dtype is bool or uint8
See also
Notes
The bins are evenly spaced in the interval [pda.min(), pda.max()].
Examples
>>> import matplotlib.pyplot as plt >>> A = ak.arange(0, 10, 1) >>> nbins = 3 >>> b, h = ak.histogram(A, bins=nbins) >>> h array([3, 3, 4]) >>> b array([0., 3., 6.])
# To plot, use only the left edges (now returned), and export the histogram to NumPy >>> plt.plot(b, h.to_ndarray())
- arkouda.isnan(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray#
Test a pdarray for Not a number / NaN values Currently only supports float-value-based arrays
- Parameters:
pda (pdarray to test) –
- Return type:
pdarray consisting of True / False values; True where NaN, False otherwise
- Raises:
TypeError – Raised if the parameter is not a pdarray
RuntimeError – if the underlying pdarray is not float-based
- class arkouda.GroupBy(keys: groupable | None = None, assume_sorted: bool = False, **kwargs)#
Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.
- Parameters:
keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row
assume_sorted (bool) – If True, assume keys is already sorted (Default: False)
- nkeys#
The number of key arrays (columns)
- Type:
int
- size#
The length of the input array(s), i.e. number of rows
- Type:
int
- unique_keys#
The unique values of the keys array(s), in grouped order
- Type:
(list of) pdarray, Strings, or Categorical
- ngroups#
The length of the unique_keys array(s), i.e. number of groups
- Type:
int
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
- Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that groups the array
If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.
- Reductions#
- objType = 'GroupBy'#
- static from_return_msg(rep_msg)#
- to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')#
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Returns:
None
GroupBy is not currently supported by Parquet
- update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)#
- size() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
See also
Notes
This alias for “count” was added to conform with Pandas API
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.size() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- count() Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Count the number of elements in each group, i.e. the number of times each key appears.
- Parameters:
none –
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- aggregate(values: groupable, operator: str, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, groupable]#
Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.
- Parameters:
values (pdarray) – The values to group and reduce
operator (str) – The name of the reduction operator to use
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
aggregates (groupable) – One aggregate value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if the requested operator is not supported for the values dtype
Examples
>>> keys = ak.arange(0, 10) >>> vals = ak.linspace(-1, 1, 10) >>> g = ak.GroupBy(keys) >>> g.aggregate(vals, 'sum') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768, -0.55555555555555536, -0.33333333333333348, -0.11111111111111116, 0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768, 1])) >>> g.aggregate(vals, 'min') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779, -0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116, 0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
- sum(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.
- Parameters:
values (pdarray) – The values to group and sum
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_sums (pdarray) – One sum per unique key in the GroupBy instance
skipna (bool) – boolean which determines if NANs should be skipped
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The grouped sum of a boolean
pdarrayreturns integers.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.sum(b) (array([2, 3, 4]), array([8, 14, 6]))
- prod(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.
- Parameters:
values (pdarray) – The values to group and multiply
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_products (pdarray, float64) – One product per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if prod is not supported for the values dtype
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.prod(b) (array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
- var(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.
- Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean, i.e.,
var = mean((x - x.mean())**2).The mean is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of a hypothetical infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.var(b) (array([2 3 4]), array([2.333333333333333 1.2 0]))
- std(values: arkouda.pdarrayclass.pdarray, skipna: bool = True, ddof: arkouda.dtypes.int_scalars = 1) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.
- Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,
std = sqrt(mean((x - x.mean())**2)).The average squared deviation is normally calculated as
x.sum() / N, whereN = len(x). If, however, ddof is specified, the divisorN - ddofis used instead. In standard statistical practice,ddof=1provides an unbiased estimator of the variance of the infinite population.ddof=0provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1, it will not be an unbiased estimate of the standard deviation per se.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.std(b) (array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
- mean(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.
- Parameters:
values (pdarray) – The values to group and average
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.mean(b) (array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
- median(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.
- Parameters:
values (pdarray) – The values to group and find median
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,9) >>> a array([4 1 4 3 2 2 2 3 3]) >>> g = ak.GroupBy(a) >>> g.keys array([4 1 4 3 2 2 2 3 3]) >>> b = ak.linspace(-5,5,9) >>> b array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5]) >>> g.median(b) (array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
- min(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find minima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_minima (pdarray) – One minimum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if min is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.min(b) (array([2, 3, 4]), array([1, 1, 3]))
- max(values: arkouda.pdarrayclass.pdarray, skipna: bool = True) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find maxima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_maxima (pdarray) – One maximum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if max is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.max(b) (array([2, 3, 4]), array([4, 4, 3]))
- argmin(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmin
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmin(b) (array([2, 3, 4]), array([5, 4, 2]))
- argmax(values: arkouda.pdarrayclass.pdarray) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmax
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmax(b) (array([2, 3, 4]), array([9, 3, 2]))
- nunique(values: groupable) Tuple[groupable, arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.
- Parameters:
values (pdarray, int64) – The values to group and find unique values
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if nunique is not supported for the values dtype
Examples
>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> data array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> labels ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g = ak.GroupBy(labels) >>> g.keys ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g.nunique(data) array([1,2,3,4]), array([2, 2, 3, 1]) # Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4 # Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4 # Group (3,3,3) has values [3,4,1] -> 3 unique values # Group (4) has values [4] -> 1 unique value
- any(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “or”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
- all(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “and”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- OR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise OR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with OR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- AND(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise AND of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with AND
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- XOR(values: arkouda.pdarrayclass.pdarray) Tuple[arkouda.pdarrayclass.pdarray | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], arkouda.pdarrayclass.pdarray]#
Bitwise XOR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with XOR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- first(values: groupable_element_type) Tuple[groupable, groupable_element_type]#
First value in each group.
- Parameters:
values (pdarray-like) – The values from which to take the first of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first value of each group
- mode(values: groupable) Tuple[groupable, groupable]#
Most common value in each group. If a group is multi-modal, return the modal value that occurs first.
- Parameters:
values ((list of) pdarray-like) – The values from which to take the mode of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) pdarray-like) – The most common value of each group
- unique(values: groupable)#
Return the set of unique values in each group, as a SegArray.
- Parameters:
values ((list of) pdarray-like) – The values to unique
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) SegArray) – The unique values of each group
- Raises:
TypeError – Raised if values is or contains Strings or Categorical
- broadcast(values: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings, permute: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings#
Fill each group’s segment with a constant value.
- Parameters:
- Returns:
The broadcasted values
- Return type:
- Raises:
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one value per segment
Notes
This function is a sparse analog of
np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.Examples
>>> a = ak.array([0, 1, 0, 1, 0]) >>> values = ak.array([3, 5]) >>> g = ak.GroupBy(a) # By default, result is in original order >>> g.broadcast(values) array([3, 5, 3, 5, 3]) # With permute=False, result is in grouped order >>> g.broadcast(values, permute=False) array([3, 3, 3, 5, 5] >>> a = ak.randint(1,5,10) >>> a array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count() >>> g.broadcast(counts > 2) array([True False True True True False True True False False]) >>> g.broadcast(counts == 3) array([True False True True True False True True False False]) >>> g.broadcast(counts < 4) array([True True True True True True True True True True])
- static build_from_components(user_defined_name: str = None, **kwargs) GroupBy#
function to build a new GroupBy object from component keys and permutation.
- Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
- Returns:
The GroupBy object created by using the given components
- Return type:
- register(user_defined_name: str) GroupBy#
Register this GroupBy object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components
- Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the GroupBy with the user_defined_name
See also
unregister,attach,unregister_groupby_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() bool#
Return True if the object is contained in the registry
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static attach(user_defined_name: str) GroupBy#
Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which GroupBy object was registered under
- Returns:
The GroupBy object created by re-attaching to the corresponding server components
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
register,is_registered,unregister,unregister_groupby_by_name
- static unregister_groupby_by_name(user_defined_name: str) None#
Function to unregister GroupBy object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the GroupBy object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- most_common(values)#
(Deprecated) See GroupBy.mode().
- arkouda.plot_dist(b, h, log=True, xlabel=None, newfig=True)#
Plot the distribution and cumulative distribution of histogram Data
- Parameters:
b (np.ndarray) – Bin edges
h (np.ndarray) – Histogram data
log (bool) – use log to scale y
xlabel (str) – Label for the x axis of the graph
newfig (bool) – Generate a new figure or not
Notes
This function does not return or display the plot. A user must have matplotlib imported in addition to arkouda to display plots. This could be updated to return the object or have a flag to show the resulting plots. See Examples Below.
Examples
>>> import arkouda as ak >>> from matplotlib import pyplot as plt >>> b, h = ak.histogram(ak.arange(10), 3) >>> ak.plot_dist(b, h.to_ndarray()) >>> # to show the plot >>> plt.show()
- arkouda.hist_all(ak_df: arkouda.dataframe.DataFrame, cols: list = [])#
Create a grid plot histogramming all numeric columns in ak dataframe
- Parameters:
ak_df (ak.DataFrame) – Full Arkouda DataFrame containing data to be visualized
cols (list) – (Optional) A specified list of columns to be plotted
Notes
This function displays the plot.
Examples
>>> import arkouda as ak >>> from arkouda.plotting import hist_all >>> ak_df = ak.DataFrame({"a": ak.array(np.random.randn(100)), "b": ak.array(np.random.randn(100)), "c": ak.array(np.random.randn(100)), "d": ak.array(np.random.randn(100)) }) >>> hist_all(ak_df)
- class arkouda.Categorical(values, **kwargs)#
Represents an array of values belonging to named categories. Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.
- Parameters:
values (Strings) – String values to convert to categories
NAvalue (str scalar) – The value to use to represent missing/null data
- permutation#
The permutation that groups the values in the same order as categories
- Type:
pdarray, int64
- size#
The number of items in the array
- Type:
Union[int,np.int64]
- nlevels#
The number of distinct categories
- Type:
Union[int,np.int64]
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
Union[int,np.int64]
- shape#
The sizes of each dimension of the array
- Type:
tuple
- BinOps#
- RegisterablePieces#
- RequiredPieces#
- permutation#
- segments#
- objType = 'Categorical'#
- dtype#
- classmethod from_codes(codes: arkouda.pdarrayclass.pdarray, categories: arkouda.strings.Strings, permutation=None, segments=None, **kwargs) Categorical#
Make a Categorical from codes and categories arrays. If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.
- Parameters:
- Returns:
The Categorical object created from the input parameters
- Return type:
- Raises:
TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object
- classmethod from_return_msg(rep_msg) Categorical#
Create categorical from return message from server
Notes
This is currently only used when reading a Categorical from HDF5 files.
- classmethod standardize_categories(arrays, NAvalue='N/A')#
Standardize an array of Categoricals so that they share the same categories.
- Parameters:
arrays (sequence of Categoricals) – The Categoricals to standardize
NAvalue (str scalar) – The value to use to represent missing/null data
- Returns:
A list of the original Categoricals remapped to the shared categories.
- Return type:
List of Categoricals
- set_categories(new_categories, NAvalue=None)#
Set categories to user-defined values.
- Parameters:
new_categories (Strings) – The array of new categories to use. Must be unique.
NAvalue (str scalar) – The value to use to represent missing/null data
- Returns:
A new Categorical with the user-defined categories. Old values present in new categories will appear unchanged. Old values not present will be assigned the NA value.
- Return type:
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray of strings corresponding to the values in this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.
- to_list() List#
Convert the Categorical to a list, transferring data from the arkouda server to Python. This conversion discards category information and produces a list of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list of strings corresponding to the values in this Categorical
- Return type:
list
Notes
The number of bytes in the Categorical cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.
- isna()#
Find where values are missing or null (as defined by self.NAvalue)
- reset_categories() Categorical#
Recompute the category labels, discarding any unused labels. This method is often useful after slicing or indexing a Categorical array, when the resulting array only contains a subset of the original categories. In this case, eliminating unused categories can speed up other operations.
- Returns:
A Categorical object generated from the current instance
- Return type:
- contains(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element contains the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- startswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- endswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- in1d(test: arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray#
Test whether each element of the Categorical object is also present in the test Strings or Categorical object.
Returns a boolean array the same length as self that is True where an element of self is in test and False otherwise.
- Parameters:
test (Union[Strings,Categorical]) – The values against which to test each value of ‘self`.
- Returns:
The values self[in1d] are in the test Strings or Categorical object.
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if test is not a Strings or Categorical object
See also
Notes
in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences.
in1d(a, b)is logically equivalent toak.array([item in b for item in a]), but is much faster and scales to arbitrarily largea.Examples
>>> strings = ak.array([f'String {i}' for i in range(0,5)]) >>> cat = ak.Categorical(strings) >>> ak.in1d(cat,strings) array([True, True, True, True, True]) >>> strings = ak.array([f'String {i}' for i in range(5,9)]) >>> catTwo = ak.Categorical(strings) >>> ak.in1d(cat,catTwo) array([False, False, False, False, False])
- unique() Categorical#
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Compute a 128-bit hash of each element of the Categorical.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- group() arkouda.pdarrayclass.pdarray#
Return the permutation that groups the array, placing equivalent categories together. All instances of the same category are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
Notes
This method is faster than the corresponding Strings method. If the Categorical was created from a Strings object, then this function simply returns the cached permutation. Even if the Categorical was created using from_codes(), this function will be faster than Strings.group() because it sorts dense integer values, rather than 128-bit hash values.
- argsort()#
- sort()#
- concatenate(others: Sequence[Categorical], ordered: bool = True) Categorical#
Merge this Categorical with other Categorical objects in the array, concatenating the arrays and synchronizing the categories.
- Parameters:
others (Sequence[Categorical]) – The Categorical arrays to concatenate and merge with this one
ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.
- Returns:
The merged Categorical object
- Return type:
- Raises:
TypeError – Raised if any others array objects are not Categorical objects
Notes
This operation can be expensive – slower than concatenating Strings.
- to_hdf(prefix_path, dataset='categorical_array', mode='truncate', file_type='distribute')#
Save the Categorical to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale.
- Return type:
None
See also
- update_hdf(prefix_path, dataset='categorical_array', repack=True)#
Overwrite the dataset with the name provided with this Categorical object. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Categorical
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, the repack option allows for automatic creation of a file without the inaccessible data.
- to_parquet(prefix_path: str, dataset: str = 'categorical_array', mode: str = 'truncate', compression: str | None = None) str#
This functionality is currently not supported and will also raise a RuntimeError. Support is in development. Save the Categorical to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in HDF5 files (must not already exist)
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Categorical dataset within existing files.
compression (str (Optional)) – Default None Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – On run due to compatability issues of Categorical with Parquet.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.See also
- save(prefix_path: str, dataset: str = 'categorical_array', file_format: str = 'HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) str#
DEPRECATED Save the Categorical object to HDF5 or Parquet. The result is a collection of HDF5/Parquet files, one file per locale of the arkouda server, where each filename starts with prefix_path and dataset. Each locale saves its chunk of the Strings array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in HDF5 files (must not already exist) :type dataset: str :param file_format: The format to save the file to. :type file_format: str {‘HDF5 | ‘Parquet’} :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Categorical dataset within existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
compression (str (Optional)) – {None | ‘snappy’ | ‘gzip’ | ‘brotli’ | ‘zstd’ | ‘lz4’} The compression type to use when writing. This is only supported for Parquet files and will not be used with HDF5.
- Return type:
String message indicating result of save operation
- Raises:
ValueError – Raised if the lengths of columns and values differ, or the mode is neither ‘truncate’ nor ‘append’
TypeError – Raised if prefix_path, dataset, or mode is not a str
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter.
See also
-,-
- register(user_defined_name: str) Categorical#
Register this Categorical object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Categorical is to be registered under, this will be the root name for underlying components
- Returns:
The same Categorical which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Categoricals with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Categorical with the user_defined_name
See also
unregister,attach,unregister_categorical_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister() None#
Unregister this Categorical object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
register,attach,unregister_categorical_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
register,attach,unregister,unregister_categorical_by_nameNotes
Objects registered with the server are immune to deletion until they are unregistered.
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- static attach(user_defined_name: str) Categorical#
DEPRECATED Function to return a Categorical object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which Categorical object was registered under
- Returns:
The Categorical object created by re-attaching to the corresponding server components
- Return type:
- Raises:
TypeError – if user_defined_name is not a string
- static unregister_categorical_by_name(user_defined_name: str) None#
Function to unregister Categorical object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the Categorical object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- static parse_hdf_categoricals(d: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings]) Tuple[List[str], Dict[str, Categorical]]#
This function should be used in conjunction with the load_all function which reads hdf5 files and reconstitutes Categorical objects. Categorical objects use a naming convention and HDF5 structure so they can be identified and constructed for the user.
In general you should not call this method directly
- Parameters:
d (Dictionary of String to either Pdarray or Strings object) –
- Returns:
2-Tuple of List of strings containing key names which should be removed and Dictionary of
base name to Categorical object
See also
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a Categorical object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Categorical is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.dtypes.int_scalars)#
Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.
- entry#
Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of
offsets array: starting indices for each string
bytes array: raw bytes of all strings joined by nulls
- Type:
- size#
The number of strings in the array
- Type:
int_scalars
- nbytes#
The total number of bytes in all strings
- Type:
int_scalars
- ndim#
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape#
The sizes of each dimension of the array
- Type:
tuple
- dtype#
The dtype is ak.str
- Type:
dtype
- logger#
Used for all logging operations
- Type:
ArkoudaLogger
Notes
Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.
- BinOps#
- objType = 'Strings'#
- static from_return_msg(rep_msg: str) Strings#
Factory method for creating a Strings object from an Arkouda server response message
- Parameters:
rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.
- static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings#
Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.
- Parameters:
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.
- get_lengths() arkouda.pdarrayclass.pdarray#
Return the length of each string in the array.
- Returns:
The length of each string
- Return type:
pdarray, int
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- get_bytes()#
Getter for the bytes component (uint8 pdarray) of this Strings.
- Returns:
Pdarray of bytes of the string accessed
- Return type:
pdarray, uint8
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_bytes() [111 110 101 0 116 119 111 0 116 104 114 101 101 0]
- get_offsets()#
Getter for the offsets component (int64 pdarray) of this Strings.
- Returns:
Pdarray of offsets of the string accessed
- Return type:
pdarray, int64
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_offsets() [0 4 8]
- encode(toEncoding: str, fromEncoding: str = 'UTF-8')#
Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding
- Parameters:
toEncoding (str) – The encoding that the strings will be converted to
fromEncoding (str) – The current encoding of the strings object, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- decode(fromEncoding, toEncoding='UTF-8')#
Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding
- Parameters:
fromEncoding (str) – The current encoding of the strings object
toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- to_lower() Strings#
Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Returns:
Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_lower() array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
- to_upper() Strings#
Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Returns:
Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_upper() array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
- to_title() Strings#
Returns a new Strings from the original replaced with their titlecase equivalent
- Returns:
Strings from the original replaced with their titlecase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Strings.to_lower,String.to_upperExamples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.to_title() array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
- is_lower() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase
- Returns:
True for elements that are entirely lowercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_lower() array([True True True False False False])
- is_upper() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase
- Returns:
True for elements that are entirely uppercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.is_upper() array([False False False True True True])
- is_title() arkouda.pdarrayclass.pdarray#
Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase
- Returns:
True for elements that are titlecase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)]) >>> title = ak.array([f'Strings {i}' for i in range(3)]) >>> strings = ak.concatenate([mixed, title]) >>> strings array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2']) >>> strings.is_title() array([False False False True True True])
- strip(chars: bytes | arkouda.dtypes.str_scalars | None = '') Strings#
Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
- Parameters:
chars – the set of characters to be removed
- Returns:
Strings object with the leading and trailing characters matching the set of characters in the chars argument removed
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> strings = ak.array(['Strings ', ' StringS ', 'StringS ']) >>> s = strings.strip() >>> s array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS ', ' 1StringS 12 ']) >>> s = strings.strip(' 12') >>> s array(['Strings', 'StringS', 'StringS'])
- cached_regex_patterns() List#
Returns the regex patterns for which Match objects have been cached
- purge_cached_regex_patterns() None#
purges cached regex patterns
- find_locations(pattern: bytes | arkouda.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches
- Parameters:
pattern (str_scalars) – The regex pattern used to find matches
- Returns:
pdarray, int64 – For each original string, the number of pattern matches
pdarray, int64 – The start positons of pattern matches
pdarray, int64 – The lengths of pattern matches
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> num_matches, starts, lens = strings.find_locations('\d') >>> num_matches array([2, 2, 2, 2, 2]) >>> starts array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9]) >>> lens array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
- search(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match if any part of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+') <ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- match(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the beginning of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the beginning of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.match('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- fullmatch(pattern: bytes | arkouda.dtypes.str_scalars) arkouda.match.Match#
Returns a match object where elements match only if the whole string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the whole string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.fullmatch('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=False; matched=False>
- split(pattern: bytes | arkouda.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple#
Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur
- Parameters:
pattern (str) – Regex used to split strings into substrings
maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences
return_segments (bool) – If True, return mapping of original strings to first substring in return array.
- Returns:
Strings – Substrings with pattern matches removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.split('_+', maxsplit=2, return_segments=True) (array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
- findall(pattern: bytes | arkouda.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple#
Return a new Strings containg all non-overlapping matches of pattern
- Parameters:
pattern (str_scalars) – Regex used to find matches
return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from
- Returns:
Strings – Strings object containing only pattern matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.findall('_+', return_match_origins=True) (array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
- sub(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Strings#
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings with pattern matches replaced
- Return type:
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.sub(pattern='_+', repl='-', count=2) array(['1-2-', '-', '3', '-4-5____6___7', ''])
- subn(pattern: bytes | arkouda.dtypes.str_scalars, repl: bytes | arkouda.dtypes.str_scalars, count: int = 0) Tuple#
Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings – Strings with pattern matches replaced
pdarray, int64 – The number of substitutions made for each element of Strings
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.subn(pattern='_+', repl='-', count=2) (array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
- contains(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element contains the given substring.
- Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> strings array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5']) >>> strings.contains('string') array([True, True, True, True, True]) >>> strings.contains('string \d', regex=True) array([True, True, True, True, True])
- startswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The prefix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not a bytes ior str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.startswith('string') array([True, True, True, True, True]) >>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.startswith('\d str', regex = True) array([True, True, True, True, True])
- endswith(substr: bytes | arkouda.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray#
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The suffix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.endswith('ing') array([True, True, True, True, True]) >>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.endswith('ing \d', regex = True) array([True, True, True, True, True])
- flatten(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple#
Unpack delimiter-joined substrings into a flat array.
- Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring in return array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> orig = ak.array(['one|two', 'three|four|five', 'six']) >>> orig.flatten('|') array(['one', 'two', 'three', 'four', 'five', 'six']) >>> flat, map = orig.flatten('|', return_segments=True) >>> map array([0, 2, 5]) >>> under = ak.array(['one_two', 'three_____four____five', 'six']) >>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True) >>> under_flat array(['one', 'two', 'three', 'four', 'five', 'six']) >>> under_map array([0, 2, 5])
- peel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple#
Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The field(s) peeled from the end of each string (unless fromRight is true)
- right: Strings
The remainder of each string after peeling (unless fromRight is true)
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g'])) >>> s.peel('.', includeDelimiter=True) (array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g'])) >>> s.peel('.', times=2) (array(['', '', 'e.f']), array(['a.b', 'c.d', 'g'])) >>> s.peel('.', times=2, keepPartial=True) (array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
- rpeel(delimiter: bytes | arkouda.dtypes.str_scalars, times: arkouda.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)#
Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The remainder of the string after peeling
- right: Strings
The field(s) that were peeled from the right of each string
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.rpeel('.') (array(['a', 'c', 'e.f']), array(['b', 'd', 'g'])) # Compared against peel >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
- stick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '', toLeft: bool = False) Strings#
Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.
- Returns:
The array of joined strings
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance
ValueError – Raised if times is < 1
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.stick(t, delimiter='.') array(['a.b', 'c.d', 'e.f'])
- lstick(other: Strings, delimiter: bytes | arkouda.dtypes.str_scalars = '') Strings#
Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
- Returns:
The array of joined strings, as other + self
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.lstick(t, delimiter='.') array(['b.a', 'd.c', 'f.e'])
- get_prefixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long prefix of each string, where possible
- Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.
- Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.
- get_suffixes(n: arkouda.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray]#
Return the n-long suffix of each string, where possible
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.
- Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]#
Compute a 128-bit hash of each string.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- group() arkouda.pdarrayclass.pdarray#
Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
Notes
If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.
- Raises:
RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message
- to_ndarray() numpy.ndarray#
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same strings as this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_ndarray() array(['hello', 'my', 'world'], dtype='<U5') >>> type(a.to_ndarray()) numpy.ndarray
- to_list() list#
Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list with the same strings as this SegString
- Return type:
list
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_list() ['hello', 'my', 'world'] >>> type(a.to_list()) list
- astype(dtype) arkouda.pdarrayclass.pdarray#
Cast values of Strings object to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str#
Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str#
Save the Strings object to HDF5. The object can be saved to a collection of files or single file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must have write permission.
Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path.If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a
RuntimeErrorwill result.Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
See also
- update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)#
Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)#
Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
str reponse message
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n) at this time.
- save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str#
DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
- Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.
- info() str#
Returns a JSON formatted string containing information about all components of self
- Parameters:
None –
- Returns:
JSON string containing information about all components of self
- Return type:
str
- pretty_print_info() None#
Prints information about all components of self in a human readable format
- Parameters:
None –
- Return type:
None
- register(user_defined_name: str) Strings#
Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.
- Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
- Returns:
The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- unregister() None#
Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry
- Parameters:
None –
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- static attach(user_defined_name: str) Strings#
class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which the Strings object was registered under
- Returns:
the Strings object registered with user_defined_name in the arkouda server
- Return type:
Strings object
- Raises:
TypeError – Raised if user_defined_name is not a str
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- static unregister_strings_by_name(user_defined_name: str) None#
Unregister a Strings object in the arkouda server previously registered via register()
- Parameters:
user_defined_name (str) – The registered name of the Strings object
See also
- transfer(hostname: str, port: arkouda.dtypes.int_scalars)#
Sends a Strings object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- class arkouda.Datetime(pda, unit: str = _BASE_UNIT)#
Bases:
_AbstractBaseTimeRepresents a date and/or time.
Datetime is the Arkouda analog to pandas DatetimeIndex and other timeseries data types.
- Parameters:
pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array) –
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The
.valuesattribute is always in nanoseconds with int64 dtype.- property nanosecond#
- property microsecond#
- property millisecond#
- property second#
- property minute#
- property hour#
- property day#
- property month#
- property year#
- property day_of_year#
- property dayofyear#
- property day_of_week#
- property dayofweek#
- property weekday#
- property week#
- property weekofyear#
- property date#
- property is_leap_year#
- supported_with_datetime#
- supported_with_r_datetime#
- supported_with_timedelta#
- supported_with_r_timedelta#
- supported_opeq#
- supported_with_pdarray#
- supported_with_r_pdarray#
- special_objType = 'Datetime'#
- isocalendar()#
- to_pandas()#
Convert array to a pandas DatetimeIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.
See also
to_ndarray
- sum()#
Return the sum of all elements in the array.
- register(user_defined_name)#
Register this Datetime object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Datetime is to be registered under, this will be the root name for underlying components
- Returns:
The same Datetime which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Datetimes with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Datetimes with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister()#
Unregister this Datetime object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- is_registered() numpy.bool_#
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- class arkouda.CachedAccessor(name: str, accessor)#
Custom property-like object. A descriptor for caching accessors. :param name: Namespace that will be accessed under, e.g.
df.foo. :type name: str :param accessor: Class with the extension methods. :type accessor: clsNotes
For accessor, The class’s __init__ method assumes that one of
Series,DataFrameorIndexas the single argumentdata.
- arkouda.string_operators(cls)#
- arkouda.date_operators(cls)#
- class arkouda.Properties#
- class arkouda.DatetimeAccessor(series)#
Bases:
Properties
- class arkouda.StringAccessor(series)#
Bases:
Properties
- arkouda.get_filetype(filenames: str | List[str]) str#
Get the type of a file accessible to the server. Supported file types and possible return strings are ‘HDF5’ and ‘Parquet’.
- Parameters:
filenames (Union[str, List[str]]) – A file or list of files visible to the arkouda server
- Returns:
Type of the file returned as a string, either ‘HDF5’, ‘Parquet’ or ‘CSV
- Return type:
str
- Raises:
ValueError – Raised if filename is empty or contains only whitespace
Notes
When list provided, it is assumed that all files are the same type
CSV Files without the Arkouda Header are not supported
See also
- arkouda.ls(filename: str, col_delim: str = ',', read_nested: bool = True) List[str]#
This function calls the h5ls utility on a HDF5 file visible to the arkouda server or calls a function that imitates the result of h5ls on a Parquet file.
- Parameters:
filename (str) – The name of the file to pass to the server
col_delim (str) – The delimiter used to separate columns if the file is a csv
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet files.
- Returns:
The string output of the datasets from the server
- Return type:
str
- Raises:
TypeError – Raised if filename is not a str
ValueError – Raised if filename is empty or contains only whitespace
RuntimeError – Raised if error occurs in executing ls on an HDF5 file
Notes –
- This will need to be updated because Parquet will not technically support this when we update.
Similar functionality will be added for Parquet in the future
For CSV files without headers, please use ls_csv
See also
- arkouda.ls_csv(filename: str, col_delim: str = ',') List[str]#
Used for identifying the datasets within a file when a CSV does not have a header.
- Parameters:
filename (str) – The name of the file to pass to the server
col_delim (str) – The delimiter used to separate columns if the file is a csv
- Returns:
The string output of the datasets from the server
- Return type:
str
See also
- arkouda.get_null_indices(filenames: str | List[str], datasets: str | List[str] | None = None) arkouda.pdarrayclass.pdarray | Mapping[str, arkouda.pdarrayclass.pdarray]#
Get null indices of a string column in a Parquet file.
- Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read. Each dataset must be a string column. There is no default value for this function, the datasets to be read must be specified.
- Returns:
For a single dataset returns an Arkouda pdarray and for multiple datasets
returns a dictionary of Arkouda pdarrays – Dictionary of {datasetName: pdarray}
- Raises:
RuntimeError – Raised if one or more of the specified files cannot be opened.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
See also
- arkouda.get_datasets(filenames: str | List[str], allow_errors: bool = False, column_delim: str = ',', read_nested: bool = True) List[str]#
Get the names of the datasets in the provide files
- Parameters:
filenames (str or List[str]) – Name of the file/s from which to return datasets
allow_errors (bool) – Default: False Whether or not to allow errors while accessing datasets
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet Files.
- Return type:
List[str] of names of the datasets
- Raises:
RuntimeError –
If no datasets are returned
Notes
This function currently supports HDF5 and Parquet formats.
Future updates to Parquet will deprecate this functionality on that format,
but similar support will be added for Parquet at that time. - If a list of files is provided, only the datasets in the first file will be returned
See also
- arkouda.get_columns(filenames: str | List[str], col_delim: str = ',', allow_errors: bool = False) List[str]#
Get a list of column names from CSV file(s).
- arkouda.read_hdf(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, tag_data=False) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index | Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index]#
Read Arkouda objects from HDF5 file/s
- Parameters:
filenames (str, List[str]) – Filename/s to read objects from
datasets (Optional str, List[str]) – datasets to read from the provided files
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strict_types (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.
tagData (bool) – Default False, if True tag the data with the code associated with the filename that the data was pulled from.
- Returns:
For a single dataset returns an Arkouda pdarray, Arkouda Strings, Arkouda Segarrays,
or Arkouda ArrayViews. For multiple datasets returns a dictionary of Arkouda pdarrays,
Arkouda Strings, Arkouda Segarrays, or Arkouda ArrayViews. – Dictionary of {datasetName: pdarray, String, SegArray, or ArrayView}
- Raises:
ValueError – Raised if all datasets are not present in all hdf5 files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.
If datasets is None, infer the names of datasets from the first file and read all of them. Use
get_datasetsto show the names of datasets to HDF5 files.See also
Examples
>>> # Read with file Extension >>> x = ak.read_hdf('path/name_prefix.h5') # load HDF5 # Read Glob Expression >>> x = ak.read_hdf('path/name_prefix*') # Reads HDF5
- arkouda.read_parquet(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, tag_data: bool = False, read_nested: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index | Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index]#
Read Arkouda objects from Parquet file/s
- Parameters:
filenames (str, List[str]) – Filename/s to read objects from
datasets (Optional str, List[str]) – datasets to read from the provided files
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strict_types (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
tagData (bool) – Default False, if True tag the data with the code associated with the filename that the data was pulled from.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. If datasets is not None, this will be ignored.
- Returns:
For a single dataset returns an Arkouda pdarray, Arkouda Strings, or Arkouda ArrayView object
and for multiple datasets returns a dictionary of Arkouda pdarrays,
Arkouda Strings or Arkouda ArrayView. – Dictionary of {datasetName: pdarray or String}
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.
If datasets is None, infer the names of datasets from the first file and read all of them. Use
get_datasetsto show the names of datasets to Parquet files.Parquet always recomputes offsets at this time This will need to be updated once parquets workflow is updated
See also
Examples
Read without file Extension >>> x = ak.read_parquet(‘path/name_prefix.parquet’) # load Parquet Read Glob Expression >>> x = ak.read_parquet(‘path/name_prefix*’) # Reads Parquet
- arkouda.read_csv(filenames: str | List[str], datasets: str | List[str] | None = None, column_delim: str = ',', allow_errors: bool = False) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index | Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index]#
Read CSV file(s) into Arkouda objects. If more than one dataset is found, the objects will be returned in a dictionary mapping the dataset name to the Arkouda object containing the data. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings object.
- Parameters:
filenames (str or List[str]) – The filenames to read data from
datasets (str or List[str] (Optional)) – names of the datasets to read. When None, all datasets will be read.
column_delim (str) – The delimiter for column names and data. Defaults to “,”.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
- Returns:
pdarray, Strings or Mapping {dset_name
- Return type:
obj} where obj is a pdarray or Strings.
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
See also
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n) at this time.Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- arkouda.read(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets=False, column_delim: str = ',', read_nested: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index | Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index]#
Read datasets from files. File Type is determined automatically.
- Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Ignored if datasets is not None Parquet Files only.
- Returns:
For a single dataset returns an Arkouda pdarray, Arkouda Strings, Arkouda Segarrays,
or Arkouda ArrayViews. For multiple datasets returns a dictionary of Arkouda pdarrays,
Arkouda Strings, Arkouda Segarrays, or Arkouda ArrayViews. – Dictionary of {datasetName: pdarray, String, SegArray, or ArrayView}
- Raises:
RuntimeError – If invalid filetype is detected
See also
Notes
If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.
If datasets is None, infer the names of datasets from the first file and read all of them. Use
get_datasetsto show the names of datasets to HDF5/Parquet files.CSV files without the Arkouda Header are not supported.
Examples
Read with file Extension >>> x = ak.read(‘path/name_prefix.h5’) # load HDF5 - processing determines file type not extension Read without file Extension >>> x = ak.read(‘path/name_prefix.parquet’) # load Parquet Read Glob Expression >>> x = ak.read(‘path/name_prefix*’) # Reads HDF5
- arkouda.read_tagged_data(filenames: str | List[str], datasets: str | List[str] | None = None, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets=False, read_nested: bool = True)#
Read datasets from files and tag each record to the file it was read from. File Type is determined automatically.
- Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Ignored if datasets is not None Parquet Files only.
Notes
Not currently supported for Categorical or GroupBy datasets
Examples
Read files and return data with tagging corresponding to the Categorical returned cat.codes will link the codes in data to the filename. Data will contain the code Filename_Codes >>> data, cat = ak.read_tagged_data(‘path/name’) >>> data {‘Filname_Codes’: array([0 3 6 9 12]), ‘col_name’: array([0 0 0 1])}
- arkouda.import_data(read_path: str, write_file: str = None, return_obj: bool = True, index: bool = False)#
Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or a file formatted to be read by Arkouda.
- Parameters:
read_path (str) – path to file where pandas data is stored. This can be glob expression for parquet formats.
write_file (str, optional) – path to file to write arkouda formatted data to. Only write file if provided
return_obj (bool, optional) – Default True. When True return the Arkouda DataFrame object, otherwise return None
index (bool, optional) – Default False. When True, maintain the indexes loaded from the pandas file
- Raises:
RuntimeWarning –
Export attempted on Parquet file. Arkouda formatted Parquet files are readable by pandas.
RuntimeError –
Unsupported file type
- Returns:
When return_obj=True
- Return type:
pd.DataFrame
See also
pandas.DataFrame.to_parquet,pandas.DataFrame.to_hdf,pandas.DataFrame.read_parquet,pandas.DataFrame.read_hdf,ak.exportNotes
Import can only be performed from hdf5 or parquet files written by pandas.
- arkouda.export(read_path: str, dataset_name: str = 'ak_data', write_file: str = None, return_obj: bool = True, index: bool = False)#
Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be readable by Pandas
- Parameters:
read_path (str) – path to file where arkouda data is stored.
dataset_name (str) – name to store dataset under
index (bool) – Default False. When True, maintain the indexes loaded from the pandas file
write_file (str, optional) – path to file to write pandas formatted data to. Only write the file if this is set
return_obj (bool, optional) – Default True. When True return the Pandas DataFrame object, otherwise return None
- Raises:
RuntimeError –
Unsupported file type
- Returns:
When return_obj=True
- Return type:
pd.DataFrame
See also
pandas.DataFrame.to_parquet,pandas.DataFrame.to_hdf,pandas.DataFrame.read_parquet,pandas.DataFrame.read_hdf,ak.import_dataNotes
If Arkouda file is exported for pandas, the format will not change. This mean parquet files will remain parquet and hdf5 will remain hdf5.
Export can only be performed from hdf5 or parquet files written by Arkouda. The result will be the same file type, but formatted to be read by Pandas.
- arkouda.to_hdf(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView], prefix_path: str, names: List[str] = None, mode: str = 'truncate', file_type: str = 'distribute') None#
Save multiple named pdarrays to HDF5 files.
- Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist. If ‘append’, attempt to create new dataset in existing files.
file_type (str ("single" | "distribute")) – Default: distribute Single writes the dataset to a single file Distribute writes the dataset to a file per locale
- Return type:
None
- Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’
RuntimeError – Raised if a server-side error is thrown saving the pdarray
See also
Notes
Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the HDF5 dataset names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> a = ak.arange(25) >>> b = ak.arange(25)
>>> # Save with mapping defining dataset names >>> ak.to_hdf({'a': a, 'b': b}, 'path/name_prefix')
>>> # Save using names instead of mapping >>> ak.to_hdf([a, b], 'path/name_prefix', names=['a', 'b'])
- arkouda.to_parquet(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView], prefix_path: str, names: List[str] = None, mode: str = 'truncate', compression: str | None = None, convert_categoricals: bool = False) None#
Save multiple named pdarrays to Parquet files.
- Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist. If ‘append’, attempt to create new dataset in existing files. ‘append’ is deprecated, please use the multi-column write
compression (str (Optional)) –
- Default None
Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
- convert_categoricals: bool
Defaults to False Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. if set, write the equivalent Strings in place of any Categorical columns.
- Return type:
None
- Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the Parquet column names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> a = ak.arange(25) >>> b = ak.arange(25)
>>> # Save with mapping defining dataset names >>> ak.to_parquet({'a': a, 'b': b}, 'path/name_prefix')
>>> # Save using names instead of mapping >>> ak.to_parquet([a, b], 'path/name_prefix', names=['a', 'b'])
- arkouda.to_csv(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], prefix_path: str, names: List[str] = None, col_delim: str = ',', overwrite: bool = False)#
Write Arkouda object(s) to CSV file(s). All CSV Files written by Arkouda include a header denoting data types of the columns.
- Parameters:
columns (Mapping[str, pdarray] or List[pdarray]) – The objects to be written to CSV file. If a mapping is used and names is None the keys of the mapping will be used as the dataset names.
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
names (List[str] (Optional)) – names of dataset to be written. Order should correspond to the order of data provided in columns.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
None
- Raises:
ValueError – Raised if any datasets are present in all csv files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
See also
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n) at this time.Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- arkouda.save_all(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView], prefix_path: str, names: List[str] = None, file_format='HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) None#
DEPRECATED Save multiple named pdarrays to HDF5/Parquet files. :param columns: Collection of arrays to save :type columns: dict or list of pdarrays :param prefix_path: Directory and filename prefix for output files :type prefix_path: str :param names: Dataset names for the pdarrays :type names: list of str :param file_format: ‘HDF5’ or ‘Parquet’. Defaults to hdf5 :type file_format: str :param mode: By default, truncate (overwrite) the output files if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: distribute Single writes the dataset to a single file Distribute writes the dataset to a file per locale Only used with HDF5
compression (str (None | "snappy" | "gzip" | "brotli" | "zstd" | "lz4")) – Optional Select the compression to use with Parquet files. Only used with Parquet.
- Return type:
None
- Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’
See also
save,load_all,to_parquet,to_hdfNotes
Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the HDF5 dataset names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> a = ak.arange(25) >>> b = ak.arange(25) >>> # Save with mapping defining dataset names >>> ak.save_all({'a': a, 'b': b}, 'path/name_prefix', file_format='Parquet') >>> # Save using names instead of mapping >>> ak.save_all([a, b], 'path/name_prefix', names=['a', 'b'], file_format='Parquet')
- arkouda.load(path_prefix: str, file_format: str = 'INFER', dataset: str = 'array', calc_string_offsets: bool = False, column_delim: str = ',') arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index | Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index]#
Load a pdarray previously saved with
pdarray.save().- Parameters:
path_prefix (str) – Filename prefix used to save the original pdarray
file_format (str) – ‘INFER’, ‘HDF5’ or ‘Parquet’. Defaults to ‘INFER’. Used to indicate the file type being loaded. If INFER, this will be detected during processing
dataset (str) – Dataset name where the pdarray was saved, defaults to ‘array’
calc_string_offsets (bool) – If True the server will ignore Segmented Strings ‘offsets’ array and derive it from the null-byte terminators. Defaults to False currently
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
- Returns:
The pdarray or Strings that was previously saved
- Return type:
- Raises:
TypeError – Raised if either path_prefix or dataset is not a str
ValueError – Raised if invalid file_format or if the dataset is not present in all hdf5 files or if the path_prefix does not correspond to files accessible to Arkouda
RuntimeError – Raised if the hdf5 files are present but there is an error in opening one or more of them
See also
Notes
If you have a previously saved Parquet file that is raising a FileNotFound error, try loading it with a .parquet appended to the prefix_path. Parquet files were previously ALWAYS stored with a
.parquetextension.ak.load does not support loading a single file. For loading single HDF5 files without the _LOCALE#### suffix please use ak.read().
CSV files without the Arkouda Header are not supported.
Examples
>>> # Loading from file without extension >>> obj = ak.load('path/prefix') Loads the array from numLocales files with the name ``cwd/path/name_prefix_LOCALE####``. The file type is inferred during processing.
>>> # Loading with an extension (HDF5) >>> obj = ak.load('path/prefix.test') Loads the object from numLocales files with the name ``cwd/path/name_prefix_LOCALE####.test`` where #### is replaced by each locale numbers. Because filetype is inferred during processing, the extension is not required to be a specific format.
- arkouda.load_all(path_prefix: str, file_format: str = 'INFER', column_delim: str = ',', read_nested=True) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical]#
Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with
save_all().- Parameters:
path_prefix (str) – Filename prefix used to save the original pdarray
file_format (str) – ‘INFER’, ‘HDF5’, ‘Parquet’, or ‘CSV’. Defaults to ‘INFER’. Indicates the format being loaded. When ‘INFER’ the processing will detect the format Defaults to ‘INFER’
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Parquet files only
- Returns:
Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]} with the previously saved pdarrays, Strings, SegArrays, or Categoricals
- Return type:
Mapping[str, Union[pdarray, Strings, SegArray, Categorical]]
- Raises:
TypeError: – Raised if path_prefix is not a str
ValueError – Raised if file_format/extension is encountered that is not hdf5 or parquet or if all datasets are not present in all hdf5/parquet files or if the path_prefix does not correspond to files accessible to Arkouda
RuntimeError – Raised if the hdf5 files are present but there is an error in opening one or more of them
See also
to_parquet,to_hdf,load,readNotes
This function has been updated to determine the file extension based on the file format variable
This function will be deprecated when glob flags are added to read_* methods
CSV files without the Arkouda Header are not supported.
- arkouda.update_hdf(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.array_view.ArrayView], prefix_path: str, names: List[str] = None, repack: bool = True)#
Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary
- Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the datasets
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the datasets provided do not exist, they will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
This workflow is slightly different from to_hdf to prevent reading and creating a copy of the file for each dataset
- arkouda.snapshot(filename)#
Create a snapshot of the current Arkouda namespace. All currently accessible variables containing Arkouda objects will be written to an HDF5 file.
Unlike other save/load functions, this maintains the integrity of dataframes.
Current Variable names are used as the dataset name when saving.
- Parameters:
filename (str) –
file (Name to use when storing) –
- Return type:
None
See also
ak.restore
- arkouda.restore(filename)#
Return data saved using ak.snapshot
- Parameters:
filename (str) –
read (Name used to create snapshot to be) –
- Return type:
Dict
Notes
Unlike other save/load methods using snapshot restore will save DataFrames alongside other objects in HDF5. Thus, they are returned within the dictionary as a dataframe.
- arkouda.receive(hostname: str, port)#
Receive a pdarray sent by pdarray.transfer().
- Parameters:
hostname (str) – The hostname of the pdarray that sent the array
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.transfer().
- Returns:
The pdarray sent from the sending server to the current receiving server.
- Return type:
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.receive_dataframe(hostname: str, port)#
Receive a pdarray sent by dataframe.transfer().
- Parameters:
hostname (str) – The hostname of the dataframe that sent the array
port (int_scalars) – The port to send the dataframe over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.send_array().
- Returns:
The dataframe sent from the sending server to the current receiving server.
- Return type:
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.attach(name: str)#
- arkouda.unregister(name: str)#
- arkouda.attach_all(names: list)#
Attach to all objects registered with the names provide
- Parameters:
names (list) – List of names to attach to
- Return type:
dict
- arkouda.unregister_all(names: list)#
Unregister all names provided
- Parameters:
names (list) – List of names used to register objects to be unregistered
- Return type:
None
- arkouda.register_all(data: dict)#
Register all objects in the provided dictionary
- Parameters:
data (dict) – Maps name to register the object to the object. For example, {“MyArray”: ak.array([0, 1, 2])
- Return type:
None
- arkouda.is_registered(name: str, as_component: bool = False) bool#
Determine if the name provided is associated with a registered Object
- Parameters:
name (str) – The name to check for in the registry
as_component (bool) – Default: False When True, the name will be checked to determine if it is registered as a component of a registered object
- Return type:
bool